Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-20 Thread Xavier Hernandez

On 20/01/17 08:55, Ankireddypalle Reddy wrote:

Xavi,
   Thanks.  Please let me know the functions that we need to track for 
any inconsistencies in the return codes from multiple bricks for issue 1. I 
will start doing that.

  1. Why the write fails in first place


The best way would be to see the logs. Related functions already log 
messages when this happens.


In ec_check_status() there's a message logged if something has failed, 
but before that there should also be some error messages indicating the 
reason of the failure.


Please, note that some of the errors logged by ec_check_status() are not 
real problems. See patch http://review.gluster.org/16435/ for more info.


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Friday, January 20, 2017 2:41 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:

Ashish,

 Thanks for looking in to the issue. In the given
example the size/version matches for file on glusterfs4 and glusterfs5
nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
goes down. Though the SLA factor of 2 is met  still I will not be able
to access the data.


True, but having a brick with inconsistent data is the same as having it
down. You would have lost 2 bricks out of 3.

The problem is how to detect the inconsistent data, what is causing it
and why self-heal (apparently) is not healing it.


The problem is that writes did not fail for the file
to indicate the issue to the application.


That's the expected behavior. Since we have redundancy 1, a loss or
failure on a single fragment is hidden from the application. However
this triggers internal procedures to repair the problem on the
background. This also seems to not be working.

There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one
of the bricks is reporting an error, even if ec is able to report
success to the application later.

For 2 the analysis could be more complex, but most probably there should
be some warning or error message in the mount log and/or self-heal log
of one of the servers.

Xavi


 It’s not that we are
encountering issue for every file on the mount point. The issue happens
randomly for different files.



[root@glusterfs4 glusterfs]# gluster volume info



Volume Name: glusterfsProd

Type: Distributed-Disperse

Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c

Status: Started

Number of Bricks: 12 x (2 + 1) = 36

Transport-type: tcp

Bricks:

Brick1: glusterfs4sds:/ws/disk1/ws_brick

Brick2: glusterfs5sds:/ws/disk1/ws_brick

Brick3: glusterfs6sds:/ws/disk1/ws_brick

Brick4: glusterfs4sds:/ws/disk10/ws_brick

Brick5: glusterfs5sds:/ws/disk10/ws_brick

Brick6: glusterfs6sds:/ws/disk10/ws_brick

Brick7: glusterfs4sds:/ws/disk11/ws_brick

Brick8: glusterfs5sds:/ws/disk11/ws_brick

Brick9: glusterfs6sds:/ws/disk11/ws_brick

Brick10: glusterfs4sds:/ws/disk2/ws_brick

Brick11: glusterfs5sds:/ws/disk2/ws_brick

Brick12: glusterfs6sds:/ws/disk2/ws_brick

Brick13: glusterfs4sds:/ws/disk3/ws_brick

Brick14: glusterfs5sds:/ws/disk3/ws_brick

Brick15: glusterfs6sds:/ws/disk3/ws_brick

Brick16: glusterfs4sds:/ws/disk4/ws_brick

Brick17: glusterfs5sds:/ws/disk4/ws_brick

Brick18: glusterfs6sds:/ws/disk4/ws_brick

Brick19: glusterfs4sds:/ws/disk5/ws_brick

Brick20: glusterfs5sds:/ws/disk5/ws_brick

Brick21: glusterfs6sds:/ws/disk5/ws_brick

Brick22: glusterfs4sds:/ws/disk6/ws_brick

Brick23: glusterfs5sds:/ws/disk6/ws_brick

Brick24: glusterfs6sds:/ws/disk6/ws_brick

Brick25: glusterfs4sds:/ws/disk7/ws_brick

Brick26: glusterfs5sds:/ws/disk7/ws_brick

Brick27: glusterfs6sds:/ws/disk7/ws_brick

Brick28: glusterfs4sds:/ws/disk8/ws_brick

Brick29: glusterfs5sds:/ws/disk8/ws_brick

Brick30: glusterfs6sds:/ws/disk8/ws_brick

Brick31: glusterfs4sds:/ws/disk9/ws_brick

Brick32: glusterfs5sds:/ws/disk9/ws_brick

Brick33: glusterfs6sds:/ws/disk9/ws_brick

Brick34: glusterfs4sds:/ws/disk12/ws_brick

Brick35: glusterfs5sds:/ws/disk12/ws_brick

Brick36: glusterfs6sds:/ws/disk12/ws_brick

Options Reconfigured:

storage.build-pgfid: on

performance.readdir-ahead: on

nfs.export-dirs: off

nfs.export-volumes: off

nfs.disable: on

auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds

diagnostics.client-log-level: INFO

[root@glusterfs4 glusterfs]#



Thanks and Regards,

Ram

*From:*Ashish Pandey [mailto:aspan...@redhat.com]
*Sent:* Thursday, January 19, 2017 10:36 PM
*To:* Ankireddypalle Reddy
*Cc:* Xavier Hernandez; gluster-users@gluster.org; Gluster Devel
(gluster-de...@gluster.org)
*Subject:* Re: [Gluster

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-19 Thread Ankireddypalle Reddy
Xavi,
   Thanks.  Please let me know the functions that we need to track for 
any inconsistencies in the return codes from multiple bricks for issue 1. I 
will start doing that.

  1. Why the write fails in first place

Thanks and Regards,
Ram
 
-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es] 
Sent: Friday, January 20, 2017 2:41 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:
> Ashish,
>
>  Thanks for looking in to the issue. In the given
> example the size/version matches for file on glusterfs4 and glusterfs5
> nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
> goes down. Though the SLA factor of 2 is met  still I will not be able
> to access the data.

True, but having a brick with inconsistent data is the same as having it 
down. You would have lost 2 bricks out of 3.

The problem is how to detect the inconsistent data, what is causing it 
and why self-heal (apparently) is not healing it.

> The problem is that writes did not fail for the file
> to indicate the issue to the application.

That's the expected behavior. Since we have redundancy 1, a loss or 
failure on a single fragment is hidden from the application. However 
this triggers internal procedures to repair the problem on the 
background. This also seems to not be working.

There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one 
of the bricks is reporting an error, even if ec is able to report 
success to the application later.

For 2 the analysis could be more complex, but most probably there should 
be some warning or error message in the mount log and/or self-heal log 
of one of the servers.

Xavi

>  It’s not that we are
> encountering issue for every file on the mount point. The issue happens
> randomly for different files.
>
>
>
> [root@glusterfs4 glusterfs]# gluster volume info
>
>
>
> Volume Name: glusterfsProd
>
> Type: Distributed-Disperse
>
> Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c
>
> Status: Started
>
> Number of Bricks: 12 x (2 + 1) = 36
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: glusterfs4sds:/ws/disk1/ws_brick
>
> Brick2: glusterfs5sds:/ws/disk1/ws_brick
>
> Brick3: glusterfs6sds:/ws/disk1/ws_brick
>
> Brick4: glusterfs4sds:/ws/disk10/ws_brick
>
> Brick5: glusterfs5sds:/ws/disk10/ws_brick
>
> Brick6: glusterfs6sds:/ws/disk10/ws_brick
>
> Brick7: glusterfs4sds:/ws/disk11/ws_brick
>
> Brick8: glusterfs5sds:/ws/disk11/ws_brick
>
> Brick9: glusterfs6sds:/ws/disk11/ws_brick
>
> Brick10: glusterfs4sds:/ws/disk2/ws_brick
>
> Brick11: glusterfs5sds:/ws/disk2/ws_brick
>
> Brick12: glusterfs6sds:/ws/disk2/ws_brick
>
> Brick13: glusterfs4sds:/ws/disk3/ws_brick
>
> Brick14: glusterfs5sds:/ws/disk3/ws_brick
>
> Brick15: glusterfs6sds:/ws/disk3/ws_brick
>
> Brick16: glusterfs4sds:/ws/disk4/ws_brick
>
> Brick17: glusterfs5sds:/ws/disk4/ws_brick
>
> Brick18: glusterfs6sds:/ws/disk4/ws_brick
>
> Brick19: glusterfs4sds:/ws/disk5/ws_brick
>
> Brick20: glusterfs5sds:/ws/disk5/ws_brick
>
> Brick21: glusterfs6sds:/ws/disk5/ws_brick
>
> Brick22: glusterfs4sds:/ws/disk6/ws_brick
>
> Brick23: glusterfs5sds:/ws/disk6/ws_brick
>
> Brick24: glusterfs6sds:/ws/disk6/ws_brick
>
> Brick25: glusterfs4sds:/ws/disk7/ws_brick
>
> Brick26: glusterfs5sds:/ws/disk7/ws_brick
>
> Brick27: glusterfs6sds:/ws/disk7/ws_brick
>
> Brick28: glusterfs4sds:/ws/disk8/ws_brick
>
> Brick29: glusterfs5sds:/ws/disk8/ws_brick
>
> Brick30: glusterfs6sds:/ws/disk8/ws_brick
>
> Brick31: glusterfs4sds:/ws/disk9/ws_brick
>
> Brick32: glusterfs5sds:/ws/disk9/ws_brick
>
> Brick33: glusterfs6sds:/ws/disk9/ws_brick
>
> Brick34: glusterfs4sds:/ws/disk12/ws_brick
>
> Brick35: glusterfs5sds:/ws/disk12/ws_brick
>
> Brick36: glusterfs6sds:/ws/disk12/ws_brick
>
> Options Reconfigured:
>
> storage.build-pgfid: on
>
> performance.readdir-ahead: on
>
> nfs.export-dirs: off
>
> nfs.export-volumes: off
>
> nfs.disable: on
>
> auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds
>
> diagnostics.client-log-level: INFO
>
> [root@glusterfs4 glusterfs]#
>
>
>
> Thanks and Regards,
>
> Ram
>
> *From:*Ashish Pandey [mailto:aspan...@redhat.com]
> *Sent:* Thursday, January 19, 2017 10:36 PM
> *To:* Ank

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-19 Thread Xavier Hernandez

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:

Ashish,

 Thanks for looking in to the issue. In the given
example the size/version matches for file on glusterfs4 and glusterfs5
nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
goes down. Though the SLA factor of 2 is met  still I will not be able
to access the data.


True, but having a brick with inconsistent data is the same as having it 
down. You would have lost 2 bricks out of 3.


The problem is how to detect the inconsistent data, what is causing it 
and why self-heal (apparently) is not healing it.



The problem is that writes did not fail for the file
to indicate the issue to the application.


That's the expected behavior. Since we have redundancy 1, a loss or 
failure on a single fragment is hidden from the application. However 
this triggers internal procedures to repair the problem on the 
background. This also seems to not be working.


There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one 
of the bricks is reporting an error, even if ec is able to report 
success to the application later.


For 2 the analysis could be more complex, but most probably there should 
be some warning or error message in the mount log and/or self-heal log 
of one of the servers.


Xavi


 It’s not that we are
encountering issue for every file on the mount point. The issue happens
randomly for different files.



[root@glusterfs4 glusterfs]# gluster volume info



Volume Name: glusterfsProd

Type: Distributed-Disperse

Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c

Status: Started

Number of Bricks: 12 x (2 + 1) = 36

Transport-type: tcp

Bricks:

Brick1: glusterfs4sds:/ws/disk1/ws_brick

Brick2: glusterfs5sds:/ws/disk1/ws_brick

Brick3: glusterfs6sds:/ws/disk1/ws_brick

Brick4: glusterfs4sds:/ws/disk10/ws_brick

Brick5: glusterfs5sds:/ws/disk10/ws_brick

Brick6: glusterfs6sds:/ws/disk10/ws_brick

Brick7: glusterfs4sds:/ws/disk11/ws_brick

Brick8: glusterfs5sds:/ws/disk11/ws_brick

Brick9: glusterfs6sds:/ws/disk11/ws_brick

Brick10: glusterfs4sds:/ws/disk2/ws_brick

Brick11: glusterfs5sds:/ws/disk2/ws_brick

Brick12: glusterfs6sds:/ws/disk2/ws_brick

Brick13: glusterfs4sds:/ws/disk3/ws_brick

Brick14: glusterfs5sds:/ws/disk3/ws_brick

Brick15: glusterfs6sds:/ws/disk3/ws_brick

Brick16: glusterfs4sds:/ws/disk4/ws_brick

Brick17: glusterfs5sds:/ws/disk4/ws_brick

Brick18: glusterfs6sds:/ws/disk4/ws_brick

Brick19: glusterfs4sds:/ws/disk5/ws_brick

Brick20: glusterfs5sds:/ws/disk5/ws_brick

Brick21: glusterfs6sds:/ws/disk5/ws_brick

Brick22: glusterfs4sds:/ws/disk6/ws_brick

Brick23: glusterfs5sds:/ws/disk6/ws_brick

Brick24: glusterfs6sds:/ws/disk6/ws_brick

Brick25: glusterfs4sds:/ws/disk7/ws_brick

Brick26: glusterfs5sds:/ws/disk7/ws_brick

Brick27: glusterfs6sds:/ws/disk7/ws_brick

Brick28: glusterfs4sds:/ws/disk8/ws_brick

Brick29: glusterfs5sds:/ws/disk8/ws_brick

Brick30: glusterfs6sds:/ws/disk8/ws_brick

Brick31: glusterfs4sds:/ws/disk9/ws_brick

Brick32: glusterfs5sds:/ws/disk9/ws_brick

Brick33: glusterfs6sds:/ws/disk9/ws_brick

Brick34: glusterfs4sds:/ws/disk12/ws_brick

Brick35: glusterfs5sds:/ws/disk12/ws_brick

Brick36: glusterfs6sds:/ws/disk12/ws_brick

Options Reconfigured:

storage.build-pgfid: on

performance.readdir-ahead: on

nfs.export-dirs: off

nfs.export-volumes: off

nfs.disable: on

auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds

diagnostics.client-log-level: INFO

[root@glusterfs4 glusterfs]#



Thanks and Regards,

Ram

*From:*Ashish Pandey [mailto:aspan...@redhat.com]
*Sent:* Thursday, January 19, 2017 10:36 PM
*To:* Ankireddypalle Reddy
*Cc:* Xavier Hernandez; gluster-users@gluster.org; Gluster Devel
(gluster-de...@gluster.org)
*Subject:* Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
disperse volume



Ram,



I don't understand what do you mean by saying "redundancy factor of 2 is
met in a 3:1 disperse volume".
You have given the xattr's of only 3 bricks.

All the above 2 sentence and output of getxattr contradicts each other.

In the given scenario, if you have (2+1) ec configuration, and 2 bricks
are having same size and version, then there should not be

any problem to access this file.  Run heal and 3rd fragment will also
bee healthy.



I think there has been major gap in providing the complete and correct
information about the volume and all the logs and activities.
Could you please provide the following -

1 - gluster v info - please give us the output of this command

2 - Let's consider only one file which you are not able to access and
find out the reason.

3 - Try to create and write some files on mount point and see if there
is any issue with new file creation if yes, why, provide logs.



Specific but enough information is required to fin

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-19 Thread Ankireddypalle Reddy
Xavi,
Finally I am able to locate the files for which we are seeing the 
mismatch.

[2017-01-19 21:20:10.002737] W [MSGID: 122056] 
[ec-combine.c:875:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP' for f5e61224-17f6-4803-9f75-dd74cd79e248
[2017-01-19 21:20:10.002776] W [MSGID: 122006] 
[ec-combine.c:207:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
combine iatt (inode: 11490333518038950472-11490333518038950472, links: 1-1, 
uid: 0-0, gid: 0-0, rdev: 0-0, size: 4800512-4734976, mode: 100775-100775) for 
f5e61224-17f6-4803-9f75-dd74cd79e248

GFID f5e61224-17f6-4803-9f75-dd74cd79e248 maps to file 
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158

The size field differs between the bricks of this sub volume.

[root@glusterfs4 glusterfs]# getfattr -d -e hex -m . 
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
trusted.ec.size=0x000e
trusted.ec.version=0x00080008

[root@glusterfs5 ws]# getfattr -d -e hex -m . 
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
trusted.ec.size=0x000e
trusted.ec.version=0x00080008

[root@glusterfs6 glusterfs]# getfattr -d -e hex -m . 
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
getfattr: Removing leading '/' from absolute path names
trusted.ec.size=0x
trusted.ec.version=0x0008

On glusterfs6 writes seem to have failed and the file appears to be an empty 
file. 
[root@glusterfs6 glusterfs]# stat 
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
  File: 
'/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158'
  Size: 0   Blocks: 0  IO Block: 4096   regular empty file

This looks like a major issue. Since we are left with no redundancy for these 
files and if the brick on other node fails then we will not be able to access 
this data though the redundancy factor of 2 is met in a 3:1 disperse volume. 

Thanks and Regards,
Ram

-Original Message-
From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
Sent: Monday, January 16, 2017 9:41 AM
To: Xavier Hernandez
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
   Thanks. I will start by tracking the delta if any in return codes 
from the bricks for writes in EC. I have a feeling that  if we could get to the 
bottom of this issue then the EIO errors could be ultimately avoided.  Please 
note that while we are testing all these on a standby setup we continue to 
encounter the EIO errors on our production setup which is running @ 3.7.18.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es] 
Sent: Monday, January 16, 2017 6:54 AM
To: Ankireddypalle Reddy
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
volume

Hi Ram,

On 16/01/17 12:33, Ankireddypalle Reddy wrote:
> Xavi,
>   Thanks. Is there any other way to map from GFID to path.

The only way I know is to search all files from bricks and lookup for 
the trusted.gfid xattr.

> I will look for a way to share the TRACE logs. Easier way might be to add 
> some extra logging. I could do that if you could let me know functions in 
> which you are interested..

The problem is that I don't know where the problem is. One possibility 
could be to track all return values from all bricks for all writes and 
then identify which ones belong to an inconsistent file.

But if this doesn't reveal anything interesting we'll need to look at 
some other place. And this can be very tedious and slow.

Anyway, what we are looking now is not the source of an EIO, since there 
are two bricks with consistent state and the file should be perfectly 
readable and writable. It's true that there's some problem here and it 
could derive in EIO if one of the healthy bricks degrades, but at least 
this file shouldn't be giving EIO errors for now.

Xavi

>
> Sent on from my iPhone
>
>> On Jan 16, 2017, at 6:23 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:
>>
>> Hi Ram,
>>
>>> On 13/01/17 18:41, Ankireddypalle Reddy wrote:
>>> Xavi,
>>> I enabled TRACE logging. The log grew up to 120GB and could not 
>>> make much out of it. Then I started logging GFID in the code where we were 
>>> seeing errors.
>>>
>>> [2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log] 
>&g

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-16 Thread Ankireddypalle Reddy
Xavi,
   Thanks. I will start by tracking the delta if any in return codes 
from the bricks for writes in EC. I have a feeling that  if we could get to the 
bottom of this issue then the EIO errors could be ultimately avoided.  Please 
note that while we are testing all these on a standby setup we continue to 
encounter the EIO errors on our production setup which is running @ 3.7.18.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es] 
Sent: Monday, January 16, 2017 6:54 AM
To: Ankireddypalle Reddy
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
volume

Hi Ram,

On 16/01/17 12:33, Ankireddypalle Reddy wrote:
> Xavi,
>   Thanks. Is there any other way to map from GFID to path.

The only way I know is to search all files from bricks and lookup for 
the trusted.gfid xattr.

> I will look for a way to share the TRACE logs. Easier way might be to add 
> some extra logging. I could do that if you could let me know functions in 
> which you are interested..

The problem is that I don't know where the problem is. One possibility 
could be to track all return values from all bricks for all writes and 
then identify which ones belong to an inconsistent file.

But if this doesn't reveal anything interesting we'll need to look at 
some other place. And this can be very tedious and slow.

Anyway, what we are looking now is not the source of an EIO, since there 
are two bricks with consistent state and the file should be perfectly 
readable and writable. It's true that there's some problem here and it 
could derive in EIO if one of the healthy bricks degrades, but at least 
this file shouldn't be giving EIO errors for now.

Xavi

>
> Sent on from my iPhone
>
>> On Jan 16, 2017, at 6:23 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:
>>
>> Hi Ram,
>>
>>> On 13/01/17 18:41, Ankireddypalle Reddy wrote:
>>> Xavi,
>>> I enabled TRACE logging. The log grew up to 120GB and could not 
>>> make much out of it. Then I started logging GFID in the code where we were 
>>> seeing errors.
>>>
>>> [2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log] 
>>> 0-glusterfsProd-disperse-0: dict=0x7fa6706bc690 
>>> ((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>> [2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log] 
>>> 0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
>>> ((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>> [2017-01-13 17:02:01.761365] W [MSGID: 122056] 
>>> [ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
>>> xdata in answers of 'LOOKUP'
>>> [2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp] 
>>> 0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 
>>> 8)
>>> [2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log] 
>>> 0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14 
>>> ((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>> [2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log] 
>>> 0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
>>> ((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>> [2017-01-13 17:02:01.761433] W [MSGID: 122056] 
>>> [ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
>>> xdata in answers of 'LOOKUP'
>>> [2017-01-13 17:02:01.761442] W [MSGID: 122006] 
>>> [ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
>>> combine iatt (inode: 11275691004192850514-11275691004192850514, gfid: 
>>> 60b990ed-d741-4176-9c7b-4d3a25fb8252  -  
>>> 60b990ed-d741-4176-9c7b-4d3a25fb8252,  links: 1-1, uid: 0-0, gid: 0-0, 
>>> rdev: 0-0,size: 406650880-406683648, mode: 100775-100775)
>>>
>>> The file for which we are seeing this error turns out to be having a GFID 
>>> of 60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>
>>> Then I tried looking for find out the file with this GFID. It pointed me to 
>>> following path. I was expecting a real file system path from the following 
>>> turorial:
>>> https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/
>>
>> I think this method only works if bricks have the inode cached.
>>
>>>
>>> getfattr -n trusted.glusterfs.pathinfo -e text 
>>> /mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>> getfattr: Remov

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-16 Thread Xavier Hernandez
0306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs6 ee]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c9436
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

It turns out that the size and version in fact does not match for one of the 
files.


It seems as if the brick on glusterfs4 didn't receive any write request (or 
they failed for some reason). Do you still have the trace log ? is there any 
way I could download it ?

Xavi



Thanks and Regards,
Ram

-Original Message-
From: gluster-devel-boun...@gluster.org 
[mailto:gluster-devel-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
Sent: Friday, January 13, 2017 4:17 AM
To: Xavier Hernandez
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
volume

Xavi,
   Thanks for explanation. Will collect TRACE logs today.

Thanks and Regards,
Ram

Sent from my iPhone


On Jan 13, 2017, at 3:03 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:

Hi Ram,


On 12/01/17 22:14, Ankireddypalle Reddy wrote:
Xavi,
   I changed the logging to log the individual bytes. Consider the 
following from ws-glus.log file where /ws/glus is the mount point.

[2017-01-12 20:47:59.368102] I [MSGID: 109063]
[dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found
anomalies in
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid =
e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0
[2017-01-12 20:47:59.391218] I [MSGID: 109036]
[dht-common.c:9082:dht_log_new_layout_for_dir_selfheal]
0-glusterfsProd-dht: Setting layout of
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
[Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 ,
Stop: 2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1,
Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash: 1 ],
[Subvol_name: glusterfsProd-disperse-10, Err: -1 , Start: 3221225469
, Stop: 3579139409 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-11, Err: -1 , Start: 3579139410 , Stop:
3937053350 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err:
-1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop: 357913940 ,
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start:
357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-5, Err: -1 , Start: 715827882 , Stop:
1073741822 , Hash

: 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , 
Stop: 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , 
Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop: 2147483645 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 
, Stop: 2505397586 , Hash: 1 ],


  Self-heal seems to be triggered for path  
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as 
per DHT.  It would be great if someone could explain what could be the anomaly 
here.  The setup where we encountered this is a fairly stable setup with no 
brick failures or no node failures.


This is not really a self-heal, at least from the point of view of ec. This 
means that DHT has found a discrepancy in the layout of that directory, however 
this doesn't mean any problem (notice the 'I' in the log, meaning that it's 
informative, not a warning nor error).

Not sure how DHT works in this case or why it finds this "anomaly", but if 
there aren't any previous errors before that message, it can be completely ignored.

Not sure if it can be related to option cluster.weighted-rebalance that is 
enabled by default.



 Then Self-heal seems to have encountered the following error.

[2017-01-12 20:48:23.418432] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two
dicts (16, 16)
[2017-01-12 20:48:23.418496] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b649520ac
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:48:23.418519] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
[2017-01-12 20:48:23.418531] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-16 Thread Ankireddypalle Reddy
4d3a25fb8252
>> 
>> [root@glusterfs6 ee]# getfattr -d -m . -e hex 
>> /ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>> getfattr: Removing leading '/' from absolute path names
>> # file: 
>> ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>> trusted.bit-rot.version=0x02005877a8dc000c9436
>> trusted.ec.config=0x080301000200
>> trusted.ec.dirty=0x0016
>> trusted.ec.size=0x306b
>> trusted.ec.version=0x2a382a38
>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>> 
>> It turns out that the size and version in fact does not match for one of the 
>> files.
> 
> It seems as if the brick on glusterfs4 didn't receive any write request (or 
> they failed for some reason). Do you still have the trace log ? is there any 
> way I could download it ?
> 
> Xavi
> 
>> 
>> Thanks and Regards,
>> Ram
>> 
>> -Original Message-
>> From: gluster-devel-boun...@gluster.org 
>> [mailto:gluster-devel-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
>> Sent: Friday, January 13, 2017 4:17 AM
>> To: Xavier Hernandez
>> Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
>> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
>> volume
>> 
>> Xavi,
>>Thanks for explanation. Will collect TRACE logs today.
>> 
>> Thanks and Regards,
>> Ram
>> 
>> Sent from my iPhone
>> 
>>> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:
>>> 
>>> Hi Ram,
>>> 
>>>> On 12/01/17 22:14, Ankireddypalle Reddy wrote:
>>>> Xavi,
>>>>I changed the logging to log the individual bytes. Consider the 
>>>> following from ws-glus.log file where /ws/glus is the mount point.
>>>> 
>>>> [2017-01-12 20:47:59.368102] I [MSGID: 109063]
>>>> [dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found
>>>> anomalies in
>>>> /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid =
>>>> e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0
>>>> [2017-01-12 20:47:59.391218] I [MSGID: 109036]
>>>> [dht-common.c:9082:dht_log_new_layout_for_dir_selfheal]
>>>> 0-glusterfsProd-dht: Setting layout of
>>>> /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
>>>> [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 ,
>>>> Stop: 2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1,
>>>> Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash: 1 ],
>>>> [Subvol_name: glusterfsProd-disperse-10, Err: -1 , Start: 3221225469
>>>> , Stop: 3579139409 , Hash: 1 ], [Subvol_name:
>>>> glusterfsProd-disperse-11, Err: -1 , Start: 3579139410 , Stop:
>>>> 3937053350 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err:
>>>> -1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
>>>> glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop: 357913940 ,
>>>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start:
>>>> 357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name:
>>>> glusterfsProd-disperse-5, Err: -1 , Start: 715827882 , Stop:
>>>> 1073741822 , Hash
>> : 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , 
>> Stop: 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: 
>> -1 , Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], [Subvol_name: 
>> glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop: 2147483645 , 
>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 
>> 2147483646 , Stop: 2505397586 , Hash: 1 ],
>>>> 
>>>>   Self-heal seems to be triggered for path  
>>>> /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies 
>>>> as per DHT.  It would be great if someone could explain what could be the 
>>>> anomaly here.  The setup where we encountered this is a fairly stable 
>>>> setup with no brick failures or no node failures.
>>> 
>>> This is not really a self-heal, at least from the point of view of ec. This 
>>> means that DHT has found a discrepancy in the layout of that directory, 
>>> however this doesn't mean any problem (notice the 'I' in the log, meaning 
>>> that it's informative, not a warning nor error).
>>> 
>>> Not sure how DHT works in this case 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-16 Thread Xavier Hernandez
ster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
volume

Xavi,
Thanks for explanation. Will collect TRACE logs today.

Thanks and Regards,
Ram

Sent from my iPhone


On Jan 13, 2017, at 3:03 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:

Hi Ram,


On 12/01/17 22:14, Ankireddypalle Reddy wrote:
Xavi,
I changed the logging to log the individual bytes. Consider the 
following from ws-glus.log file where /ws/glus is the mount point.

[2017-01-12 20:47:59.368102] I [MSGID: 109063]
[dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found
anomalies in
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid =
e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0
[2017-01-12 20:47:59.391218] I [MSGID: 109036]
[dht-common.c:9082:dht_log_new_layout_for_dir_selfheal]
0-glusterfsProd-dht: Setting layout of
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
[Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 ,
Stop: 2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1,
Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash: 1 ],
[Subvol_name: glusterfsProd-disperse-10, Err: -1 , Start: 3221225469
, Stop: 3579139409 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-11, Err: -1 , Start: 3579139410 , Stop:
3937053350 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err:
-1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop: 357913940 ,
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start:
357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-5, Err: -1 , Start: 715827882 , Stop:
1073741822 , Hash

 : 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , 
Stop: 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , 
Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop: 2147483645 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 
, Stop: 2505397586 , Hash: 1 ],


   Self-heal seems to be triggered for path  
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as 
per DHT.  It would be great if someone could explain what could be the anomaly 
here.  The setup where we encountered this is a fairly stable setup with no 
brick failures or no node failures.


This is not really a self-heal, at least from the point of view of ec. This 
means that DHT has found a discrepancy in the layout of that directory, however 
this doesn't mean any problem (notice the 'I' in the log, meaning that it's 
informative, not a warning nor error).

Not sure how DHT works in this case or why it finds this "anomaly", but if 
there aren't any previous errors before that message, it can be completely ignored.

Not sure if it can be related to option cluster.weighted-rebalance that is 
enabled by default.



  Then Self-heal seems to have encountered the following error.

[2017-01-12 20:48:23.418432] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two
dicts (16, 16)
[2017-01-12 20:48:23.418496] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b649520ac
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:48:23.418519] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
[2017-01-12 20:48:23.418531] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching 
xdata in answers of 'LOOKUP'


That's a real problem. Here we have two bricks that differ in the 
trusted.ec.version xattr. However this xattr not necessarily belongs to the 
previous directory. They are unrelated messages.



In this case glusterfsProd-disperse-2 sub volume actually consists 
of the following bricks.
glusterfs4sds:/ws/disk11/ws_brick, glusterfs5sds:
/ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick

I went ahead and checked the value of trusted.ec.version on all the 
3 bricks inside this sub vol:

[root@glusterfs6 ~]# getfattr -e hex -n trusted.ec.version 
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
# file: 
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
trusted.ec.version=0x0009000b

[root@glusterfs4 ~]# getfattr -e hex -n trusted.ec.version 
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
# file: 
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
 trusted.ec.vers

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-13 Thread Ankireddypalle Reddy
Xavi,
 I enabled TRACE logging. The log grew up to 120GB and could not 
make much out of it. Then I started logging GFID in the code where we were 
seeing errors.

[2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761365] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761433] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761442] W [MSGID: 122006] 
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
combine iatt (inode: 11275691004192850514-11275691004192850514, gfid: 
60b990ed-d741-4176-9c7b-4d3a25fb8252  -  60b990ed-d741-4176-9c7b-4d3a25fb8252,  
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648, mode: 
100775-100775)

The file for which we are seeing this error turns out to be having a GFID of 
60b990ed-d741-4176-9c7b-4d3a25fb8252  

Then I tried looking for find out the file with this GFID. It pointed me to 
following path. I was expecting a real file system path from the following 
turorial:
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/

getfattr -n trusted.glusterfs.pathinfo -e text 
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.glusterfs.pathinfo="( 
( 
<POSIX(/ws/disk1/ws_brick):glusterfs6:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>
 
<POSIX(/ws/disk1/ws_brick):glusterfs5:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>))"

Then I looked for the xatttrs for these files from all the 3 bricks

[root@glusterfs4 glusterfs]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc00041138
trusted.ec.config=0x080301000200
trusted.ec.size=0x
trusted.ec.version=0x2a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs5 bricks]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c92d0
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs6 ee]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c9436
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

It turns out that the size and version in fact does not match for one of the 
files. 

Thanks and Regards,
Ram

-Original Message-
From: gluster-devel-boun...@gluster.org 
[mailto:gluster-devel-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
Sent: Friday, January 13, 2017 4:17 AM
To: Xavier Hernandez
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
volume

Xavi,
Thanks for explanation. Will collect TRACE logs today. 

Thanks and Regards,
Ram

Sent from my iPhone

> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez <xhernan...@datalab.es> wro

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-13 Thread Ankireddypalle Reddy
erfsProd-disperse-0: Operation 
>> failed on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
>> [2017-01-12 21:10:53.728876] W [MSGID: 122002] 
>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-0: Heal failed 
>> [Invalid argument]
> 
> This seems an attempt to heal a file, but I see a lot of differences between 
> both versions. The size on one brick is 13.238.272 bytes, but on the other 
> brick it's 1.111.097.344 bytes. That's a huge difference.
> 
> Looking at the trusted.ec.version, I see that the 'data' version is very 
> different (from 161 to 14.175), however the metadata version is exactly the 
> same. This really seems like a lot of writes while one brick was down (or 
> disconnected for some reason, or writes failed for some reason). One brick 
> has lost about 14.000 writes of ~80KB.
> 
> I think the most important thing right now would be to identify which files 
> and directories are having these problems to be able to identify the cause. 
> Again, the TRACE log will be really useful.
> 
> Xavi
> 
>> 
>> Thanks and Regards,
>> Ram
>> 
>> -Original Message-
>> From: Xavier Hernandez [mailto:xhernan...@datalab.es]
>> Sent: Thursday, January 12, 2017 6:40 AM
>> To: Ankireddypalle Reddy
>> Cc: Gluster Devel (gluster-de...@gluster.org); gluster-users@gluster.org
>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
>> volume
>> 
>> Hi Ram,
>> 
>> 
>>> On 12/01/17 11:49, Ankireddypalle Reddy wrote:
>>> Xavi,
>>>  As I mentioned before the error could happen for any FOP. Will try 
>>> to run with TRACE debug level. Is there a possibility that we are checking 
>>> for this attribute on a directory, because a directory does not seem to be 
>>> having this attribute set.
>> 
>> No, directories do not have this attribute and no one should be reading it 
>> from a directory.
>> 
>>> Also is the function to check size and version called after it is decided 
>>> that heal should be run or is this check is the one which decides whether a 
>>> heal should be run.
>> 
>> Almost all checks that trigger a heal are done in the lookup fop when some 
>> discrepancy is detected.
>> 
>> The function that checks size and version is called later once a lock on the 
>> inode is acquired (even if no heal is needed). However further failures in 
>> the processing of any fop can also trigger a self-heal.
>> 
>> Xavi
>> 
>>> 
>>> Thanks and Regards,
>>> Ram
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernan...@datalab.es> 
>>>> wrote:
>>>> 
>>>> Hi Ram,
>>>> 
>>>>> On 12/01/17 02:36, Ankireddypalle Reddy wrote:
>>>>> Xavi,
>>>>> I added some more logging information. The trusted.ec.size field 
>>>>> values are in fact different.
>>>>>  trusted.ec.sizel1 = 62719407423488l2 = 0
>>>> 
>>>> That's very weird. Directories do not have this attribute. It's only 
>>>> present on regular files. But you said that the error happens while 
>>>> creating the file, so it doesn't make much sense because file creation 
>>>> always sets trusted.ec.size to 0.
>>>> 
>>>> Could you reproduce the problem with diagnostics.client-log-level set to 
>>>> TRACE and send the log to me ? it will create a big log, but I'll have 
>>>> much more information about what's going on.
>>>> 
>>>> Do you have a mixed setup with nodes of different types ? for example 
>>>> mixed 32/64 bits architectures or different operating systems ? I ask this 
>>>> because 62719407423488 in hex is 0x390B, which has the lower 32 
>>>> bits set to 0, but has garbage above that.
>>>> 
>>>>> 
>>>>>  This is a fairly static setup with no brick/ node failure.  
>>>>> Please explain why  is that a heal is being triggered and what could have 
>>>>> acutually caused these size xattrs to differ.  This is causing random I/O 
>>>>> failures and is impacting the backup schedules.
>>>> 
>>>> The launch of self-heal is normal because it has detected an 
>>>> inconsistency. The real problem is what originates that inconsistency.
>>>> 
>>>> Xavi
>>>> 
>>>>> 
>>>>> [ 2017-01-12 01:19:18.256

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-13 Thread Xavier Hernandez
V_MAGNETIC/V_30970/CHUNK_390607
  trusted.ec.version=0x0009000b

The attribute value seems to be same on all the 3 bricks.


That's a clear indication that the ec warning is not related to this 
directory because trusted.ec.version always increases, never decreases, 
and the directory has a value smaller that the one that appears in the 
log message.


If you show all dict entries in the log, it seems that it does refer to 
a directory because trusted.ec.size is not present, but it must be 
another directory than the one you looked at. We would need to find 
which one is having this issue. The TRACE log would be helpful here.





   Also please note that every single time that the 
trusted.ec.version was found to mismatch the  same values are getting logged. 
Following are 2 more instances of trusted.ec.version mismatch.

[2017-01-12 20:14:25.554540] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 
16)
[2017-01-12 20:14:25.554588] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.554608] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b6495903c 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.554624] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-2: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 20:14:25.554632] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-2: Heal failed [Invalid argument]
[2017-01-12 20:14:25.98] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 
16)
[2017-01-12 20:14:25.555622] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b64956c24 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.555638] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))



I think that this refers to the same directory. This seems an attempt to 
heal it that has failed. So it makes sense that it finds exactly the 
same values.





In glustershd.log lot of similar errors are logged.

[2017-01-12 21:10:53.728770] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 21:10:53.728804] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7f21694b6f50 
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:37:5f:0:0:0:0:0:0:37:5f:))
[2017-01-12 21:10:53.728827] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7f21694b62bc 
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:a1:0:0:0:0:0:0:37:5f:))
[2017-01-12 21:10:53.728842] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 21:10:53.728854] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-0: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 21:10:53.728876] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-0: Heal failed [Invalid argument]


This seems an attempt to heal a file, but I see a lot of differences 
between both versions. The size on one brick is 13.238.272 bytes, but on 
the other brick it's 1.111.097.344 bytes. That's a huge difference.


Looking at the trusted.ec.version, I see that the 'data' version is very 
different (from 161 to 14.175), however the metadata version is exactly 
the same. This really seems like a lot of writes while one brick was 
down (or disconnected for some reason, or writes failed for some 
reason). One brick has lost about 14.000 writes of ~80KB.


I think the most important thing right now would be to identify which 
files and directories are having these problems to be able to identify 
the cause. Again, the TRACE log will be really useful.


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Thursday, January 12, 2017 6:40 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-de...@gluster.org); gluster-users@gluster.org
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Hi Ram,


On 12/01/17 11:49, Ankireddypalle Reddy wrote:

Xavi,
  As I mentioned before the error could happen f

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-12 Thread Ankireddypalle Reddy
:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b6495903c 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.554624] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-2: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 20:14:25.554632] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-2: Heal failed [Invalid argument]
[2017-01-12 20:14:25.98] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 
16)
[2017-01-12 20:14:25.555622] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b64956c24 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:14:25.555638] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c 
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))


In glustershd.log lot of similar errors are logged.

[2017-01-12 21:10:53.728770] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 21:10:53.728804] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7f21694b6f50 
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:37:5f:0:0:0:0:0:0:37:5f:))
[2017-01-12 21:10:53.728827] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7f21694b62bc 
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:a1:0:0:0:0:0:0:37:5f:))
[2017-01-12 21:10:53.728842] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 21:10:53.728854] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-0: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 21:10:53.728876] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-0: Heal failed [Invalid argument]

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es] 
Sent: Thursday, January 12, 2017 6:40 AM
To: Ankireddypalle Reddy
Cc: Gluster Devel (gluster-de...@gluster.org); gluster-users@gluster.org
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Hi Ram,


On 12/01/17 11:49, Ankireddypalle Reddy wrote:
> Xavi,
>   As I mentioned before the error could happen for any FOP. Will try 
> to run with TRACE debug level. Is there a possibility that we are checking 
> for this attribute on a directory, because a directory does not seem to be 
> having this attribute set.

No, directories do not have this attribute and no one should be reading it from 
a directory.

> Also is the function to check size and version called after it is decided 
> that heal should be run or is this check is the one which decides whether a 
> heal should be run.

Almost all checks that trigger a heal are done in the lookup fop when some 
discrepancy is detected.

The function that checks size and version is called later once a lock on the 
inode is acquired (even if no heal is needed). However further failures in the 
processing of any fop can also trigger a self-heal.

Xavi

>
> Thanks and Regards,
> Ram
>
> Sent from my iPhone
>
>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:
>>
>> Hi Ram,
>>
>>> On 12/01/17 02:36, Ankireddypalle Reddy wrote:
>>> Xavi,
>>>  I added some more logging information. The trusted.ec.size field 
>>> values are in fact different.
>>>   trusted.ec.sizel1 = 62719407423488l2 = 0
>>
>> That's very weird. Directories do not have this attribute. It's only present 
>> on regular files. But you said that the error happens while creating the 
>> file, so it doesn't make much sense because file creation always sets 
>> trusted.ec.size to 0.
>>
>> Could you reproduce the problem with diagnostics.client-log-level set to 
>> TRACE and send the log to me ? it will create a big log, but I'll have much 
>> more information about what's going on.
>>
>> Do you have a mixed setup with nodes of different types ? for example mixed 
>> 32/64 bits architectures or different operating systems ? I ask this because 
>> 62719407423488 in hex is 0x390B, which has the lower 32 bits set to 
>> 0, but has garbage above that.
>>
>>>
>>>   This is a fairly static setup with no brick/ node failure.  
>>&

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-12 Thread Xavier Hernandez

Hi Ram,


On 12/01/17 11:49, Ankireddypalle Reddy wrote:

Xavi,
  As I mentioned before the error could happen for any FOP. Will try to 
run with TRACE debug level. Is there a possibility that we are checking for 
this attribute on a directory, because a directory does not seem to be having 
this attribute set.


No, directories do not have this attribute and no one should be reading 
it from a directory.



Also is the function to check size and version called after it is decided that 
heal should be run or is this check is the one which decides whether a heal 
should be run.


Almost all checks that trigger a heal are done in the lookup fop when 
some discrepancy is detected.


The function that checks size and version is called later once a lock on 
the inode is acquired (even if no heal is needed). However further 
failures in the processing of any fop can also trigger a self-heal.


Xavi



Thanks and Regards,
Ram

Sent from my iPhone


On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:

Hi Ram,


On 12/01/17 02:36, Ankireddypalle Reddy wrote:
Xavi,
 I added some more logging information. The trusted.ec.size field 
values are in fact different.
  trusted.ec.sizel1 = 62719407423488l2 = 0


That's very weird. Directories do not have this attribute. It's only present on 
regular files. But you said that the error happens while creating the file, so 
it doesn't make much sense because file creation always sets trusted.ec.size to 
0.

Could you reproduce the problem with diagnostics.client-log-level set to TRACE 
and send the log to me ? it will create a big log, but I'll have much more 
information about what's going on.

Do you have a mixed setup with nodes of different types ? for example mixed 
32/64 bits architectures or different operating systems ? I ask this because 
62719407423488 in hex is 0x390B, which has the lower 32 bits set to 0, 
but has garbage above that.



  This is a fairly static setup with no brick/ node failure.  Please 
explain why  is that a heal is being triggered and what could have acutually 
caused these size xattrs to differ.  This is causing random I/O failures and is 
impacting the backup schedules.


The launch of self-heal is normal because it has detected an inconsistency. The 
real problem is what originates that inconsistency.

Xavi



[ 2017-01-12 01:19:18.256970] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:18.257015] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 01:19:18.257018] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-8: Heal failed [Invalid argument]
[2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.002056] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.002064] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.209673] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.209686] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209719] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 01:19:21.209753] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-4: Heal failed [Invalid argument]

Thanks and Regards,
Ram

-Original Message-
From: Ankireddypalle Reddy
Sent: Wednesday, January 11, 2017 9:29 AM
To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel 
(gluster-de...@gluster.org); gluster-users@gluster.org
Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
   I built a debug binary to log more information. This is what is 
getting logged. Looks like it is the attribute trusted.ec.size which is 
different among the bricks in a sub volume.

In glustershd.log :

[2017-01-11 14:19:45.023845] N [MSGID: 122029] 
[ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching 
iatt in answers of 'GF_FOP_LOOKUP'
[2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027736] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-12 Thread Ankireddypalle Reddy
Xavi,
  As I mentioned before the error could happen for any FOP. Will try to 
run with TRACE debug level. Is there a possibility that we are checking for 
this attribute on a directory, because a directory does not seem to be having 
this attribute set. Also is the function to check size and version called after 
it is decided that heal should be run or is this check is the one which decides 
whether a heal should be run.

Thanks and Regards,
Ram

Sent from my iPhone

> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernan...@datalab.es> wrote:
> 
> Hi Ram,
> 
>> On 12/01/17 02:36, Ankireddypalle Reddy wrote:
>> Xavi,
>>  I added some more logging information. The trusted.ec.size field 
>> values are in fact different.
>>   trusted.ec.sizel1 = 62719407423488l2 = 0
> 
> That's very weird. Directories do not have this attribute. It's only present 
> on regular files. But you said that the error happens while creating the 
> file, so it doesn't make much sense because file creation always sets 
> trusted.ec.size to 0.
> 
> Could you reproduce the problem with diagnostics.client-log-level set to 
> TRACE and send the log to me ? it will create a big log, but I'll have much 
> more information about what's going on.
> 
> Do you have a mixed setup with nodes of different types ? for example mixed 
> 32/64 bits architectures or different operating systems ? I ask this because 
> 62719407423488 in hex is 0x390B, which has the lower 32 bits set to 
> 0, but has garbage above that.
> 
>> 
>>   This is a fairly static setup with no brick/ node failure.  Please 
>> explain why  is that a heal is being triggered and what could have acutually 
>> caused these size xattrs to differ.  This is causing random I/O failures and 
>> is impacting the backup schedules.
> 
> The launch of self-heal is normal because it has detected an inconsistency. 
> The real problem is what originates that inconsistency.
> 
> Xavi
> 
>> 
>> [ 2017-01-12 01:19:18.256970] W [MSGID: 122056] 
>> [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
>> xdata in answers of 'LOOKUP'
>> [2017-01-12 01:19:18.257015] W [MSGID: 122053] 
>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation 
>> failed on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
>> [2017-01-12 01:19:18.257018] W [MSGID: 122002] 
>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-8: Heal failed 
>> [Invalid argument]
>> [2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] 
>> 0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 
>> 8)
>> [2017-01-12 01:19:21.002056] E [dict.c:166:log_value] 
>> 0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 
>> = 0 i2 = 0 ]
>> [2017-01-12 01:19:21.002064] W [MSGID: 122056] 
>> [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
>> xdata in answers of 'LOOKUP'
>> [2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] 
>> 0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 
>> 8)
>> [2017-01-12 01:19:21.209673] E [dict.c:166:log_value] 
>> 0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 
>> = 0 i2 = 0 ]
>> [2017-01-12 01:19:21.209686] W [MSGID: 122056] 
>> [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
>> xdata in answers of 'LOOKUP'
>> [2017-01-12 01:19:21.209719] W [MSGID: 122053] 
>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: Operation 
>> failed on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
>> [2017-01-12 01:19:21.209753] W [MSGID: 122002] 
>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-4: Heal failed 
>> [Invalid argument]
>> 
>> Thanks and Regards,
>> Ram
>> 
>> -Original Message-
>> From: Ankireddypalle Reddy
>> Sent: Wednesday, January 11, 2017 9:29 AM
>> To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel 
>> (gluster-de...@gluster.org); gluster-users@gluster.org
>> Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
>> volume
>> 
>> Xavi,
>>I built a debug binary to log more information. This is what is 
>> getting logged. Looks like it is the attribute trusted.ec.size which is 
>> different among the bricks in a sub volume.
>> 
>> In glustershd.log :
>> 
>> [2017-01-11 14:19:45.023845] N [MSGID: 122029] 
>> [ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching 
>> iatt in 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-11 Thread Xavier Hernandez

Hi Ram,

On 12/01/17 02:36, Ankireddypalle Reddy wrote:

Xavi,
  I added some more logging information. The trusted.ec.size field 
values are in fact different.
   trusted.ec.sizel1 = 62719407423488l2 = 0


That's very weird. Directories do not have this attribute. It's only 
present on regular files. But you said that the error happens while 
creating the file, so it doesn't make much sense because file creation 
always sets trusted.ec.size to 0.


Could you reproduce the problem with diagnostics.client-log-level set to 
TRACE and send the log to me ? it will create a big log, but I'll have 
much more information about what's going on.


Do you have a mixed setup with nodes of different types ? for example 
mixed 32/64 bits architectures or different operating systems ? I ask 
this because 62719407423488 in hex is 0x390B, which has the 
lower 32 bits set to 0, but has garbage above that.




   This is a fairly static setup with no brick/ node failure.  Please 
explain why  is that a heal is being triggered and what could have acutually 
caused these size xattrs to differ.  This is causing random I/O failures and is 
impacting the backup schedules.


The launch of self-heal is normal because it has detected an 
inconsistency. The real problem is what originates that inconsistency.


Xavi



[ 2017-01-12 01:19:18.256970] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:18.257015] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 01:19:18.257018] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-8: Heal failed [Invalid argument]
[2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.002056] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.002064] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.209673] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.209686] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209719] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 01:19:21.209753] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-4: Heal failed [Invalid argument]

Thanks and Regards,
Ram

-Original Message-
From: Ankireddypalle Reddy
Sent: Wednesday, January 11, 2017 9:29 AM
To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel 
(gluster-de...@gluster.org); gluster-users@gluster.org
Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
I built a debug binary to log more information. This is what is 
getting logged. Looks like it is the attribute trusted.ec.size which is 
different among the bricks in a sub volume.

In glustershd.log :

[2017-01-11 14:19:45.023845] N [MSGID: 122029] 
[ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching 
iatt in answers of 'GF_FOP_LOOKUP'
[2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027736] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.027763] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027781] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.027793] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-6: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-11 14:19:45.027815] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-6: Heal failed [Invalid argument]
[2017-01-11 14:19:45.029035] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-8: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.029057] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.029089] E [dict.c:166

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-11 Thread Ankireddypalle Reddy
Xavi,
  I added some more logging information. The trusted.ec.size field 
values are in fact different.  
   trusted.ec.sizel1 = 62719407423488l2 = 0 
   
   This is a fairly static setup with no brick/ node failure.  Please 
explain why  is that a heal is being triggered and what could have acutually 
caused these size xattrs to differ.  This is causing random I/O failures and is 
impacting the backup schedules.

[ 2017-01-12 01:19:18.256970] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:18.257015] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
[2017-01-12 01:19:18.257018] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-8: Heal failed [Invalid argument]
[2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.002056] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.002064] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] 
0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-12 01:19:21.209673] E [dict.c:166:log_value] 
0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 = 0 i1 = 0 
i2 = 0 ]
[2017-01-12 01:19:21.209686] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-12 01:19:21.209719] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-12 01:19:21.209753] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-4: Heal failed [Invalid argument]

Thanks and Regards,
Ram

-Original Message-
From: Ankireddypalle Reddy 
Sent: Wednesday, January 11, 2017 9:29 AM
To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel 
(gluster-de...@gluster.org); gluster-users@gluster.org
Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
I built a debug binary to log more information. This is what is 
getting logged. Looks like it is the attribute trusted.ec.size which is 
different among the bricks in a sub volume. 

In glustershd.log :

[2017-01-11 14:19:45.023845] N [MSGID: 122029] 
[ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching 
iatt in answers of 'GF_FOP_LOOKUP'
[2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027736] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.027763] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.027781] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.027793] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-6: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-11 14:19:45.027815] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-6: Heal failed [Invalid argument]
[2017-01-11 14:19:45.029035] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-8: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.029057] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.029089] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-8: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.029105] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.029121] W [MSGID: 122053] 
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: Operation failed 
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
[2017-01-11 14:19:45.032566] E [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-11 14:19:45.029138] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 
0-glusterfsProd-disperse-8: Heal failed [Invalid argument]
[2017-01-11 14:19:45.032585] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-11 14:19:45.032614] E

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-11 Thread Ankireddypalle Reddy
] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-10: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.807298] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-10: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.807409] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-11: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.807420] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-11: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.807448] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-4: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.807462] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-4: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.807539] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-2: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.807550] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-2: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.807723] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-3: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.807739] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-3: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.807785] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-5: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.807796] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-5: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.808020] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-9: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.808034] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-9: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.808054] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-6: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.808066] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-6: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.808282] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-8: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.808292] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-8: Invalid 
config xattr [Invalid argument]
[2017-01-11 14:20:17.809212] E [MSGID: 122001] 
[ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-7: Invalid or 
corrupted config [Invalid argument]
[2017-01-11 14:20:17.809228] E [MSGID: 122066] 
[ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-7: Invalid 
config xattr [Invalid argument]

 [2017-01-11 14:20:17.812660] I [MSGID: 109036] 
[dht-common.c:8043:dht_log_new_layout_for_dir_selfheal] 2-glusterfsProd-dht: 
Setting layout of /Folder_01.05.2017_21.15/CV_MAGNETIC/V_31500/CHUNK_402578 
with [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 1789569705 , 
Stop: 2147483645 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1, Err: -1 , 
Start: 2147483646 , Stop: 2505397586 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-10, Err: -1 , Start: 2505397587 , Stop: 2863311527 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start: 2863311528 
, Stop: 3221225468 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err: -1 
, Start: 3221225469 , Stop: 3579139409 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-3, Err: -1 , Start: 3579139410 , Stop: 3937053350 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start: 3937053351 
, Stop: 4294967295 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-5, Err: -1 
, Start: 0 , Stop: 357913940 , Has
 h: 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 357913941 , 
Stop: 715827881 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , 
Start: 715827882 , Stop: 1073741822 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-8, Err: -1 , Start: 1073741823 , Stop: 1431655763 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 1431655764 
, Stop: 1789569704 , Hash: 1 ],


-Original Message-
From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
Sent: Tuesday, January 10, 2017 10:09 AM
To: Xavier Hernandez; Gluster Devel (gluster-de...@gluster.org); 
gluster-users@gluster.org
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse 
volume

Xavi,
   In this case it's the file creation which failed. So I provided the 
xattrs

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

On 10/01/17 14:42, Ankireddypalle Reddy wrote:

Attachments (2):

1



ec.txt

[Download]
(11.50
KB)

2



ws-glus.log

[Download]
(3.48
MB)

Xavi,
  We are encountering errors for different kinds of FOPS.
  The open failed for the following file:

  cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10 00:57:10 8414465
[MEDIAFS] 20117519-52075477 SingleInstancer_FS::StartDataFile2:
Failed to create the data file
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720/SFILE_CONTAINER_062],
error=0xECCC0005:{CQiFile::Open(92)} +
{CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed,
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720/SFILE_CONTAINER_062,
OperationFlag=0xC1, PermissionMode=0x1FF}

  I've attached the extended attributes for the directories
  /ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/ and

/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720
from all the bricks.

 The attributes look fine to me. I've also attached some log
cuts to illustrate the problem.


I need the extended attributes of the file itself, not the parent 
directories.


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:53 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

the error is caused by an extended attribute that does not match on all
3 bricks of the disperse set. Most probable value is trusted.ec.version,
but could be others.

At first sight, I don't see any change from 3.7.8 that could have caused
this. I'll check again.

What kind of operations are you doing ? this can help me narrow the search.

Xavi

On 10/01/17 13:43, Ankireddypalle Reddy wrote:

Xavi,
  Thanks. If you could please explain what to look for in the

extended attributes then I will check and let you know if I find
anything suspicious.  Also we noticed that some of these operations
would succeed if retried. Do you know of any communicated related errors
that are being reported/triaged.


Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:23 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:

Attachment (1):

1



ecxattrs.txt

[Download]
(5.92
KB)

Xavi,
 Please find attached the extended attributes for a
directory from all the bricks. Free space check failed for this with
error number EIO.


What do you mean ? what operation have you made to check the free

space on that directory ?


If it's a recursive check, I need the extended attributes from the

exact file that triggers the EIO. The attached attributes seem
consistent and that directory shouldn't cause any problem. Does an 'ls'
on that directory fail or does it show the contents ?


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:45 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

can you execute the following command on all bricks on a file that is
giving EIO ?

getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
  

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Ankireddypalle Reddy
Attachments (2):

1

ec.txt
 
[Download]
 (11.50 KB)

2

ws-glus.log
 
[Download]
 (3.48 MB)


Xavi,
  We are encountering errors for different kinds of FOPS.
  The open failed for the following file:

  cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10 00:57:10 8414465 
[MEDIAFS] 20117519-52075477 SingleInstancer_FS::StartDataFile2: Failed to 
create the data file 
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720/SFILE_CONTAINER_062],
 error=0xECCC0005:{CQiFile::Open(92)} + 
{CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed, 
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720/SFILE_CONTAINER_062,
 OperationFlag=0xC1, PermissionMode=0x1FF}

  I've attached the extended attributes for the directories
  /ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/ and
  /ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342720 
from all the bricks.

 The attributes look fine to me. I've also attached some log cuts to 
illustrate the problem.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:53 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

the error is caused by an extended attribute that does not match on all
3 bricks of the disperse set. Most probable value is trusted.ec.version, but 
could be others.

At first sight, I don't see any change from 3.7.8 that could have caused this. 
I'll check again.

What kind of operations are you doing ? this can help me narrow the search.

Xavi

On 10/01/17 13:43, Ankireddypalle Reddy wrote:
> Xavi,
>   Thanks. If you could please explain what to look for in the 
> extended attributes then I will check and let you know if I find anything 
> suspicious.  Also we noticed that some of these operations would succeed if 
> retried. Do you know of any communicated related errors that are being 
> reported/triaged.
>
> Thanks and Regards,
> Ram
>
> -Original Message-
> From: Xavier Hernandez [mailto:xhernan...@datalab.es]
> Sent: Tuesday, January 10, 2017 7:23 AM
> To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
> gluster-users@gluster.org
> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume
>
> Hi Ram,
>
> On 10/01/17 13:14, Ankireddypalle Reddy wrote:
>> Attachment (1):
>>
>> 1
>>
>>
>>
>> ecxattrs.txt
>> > o
>> mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e6827874
>> 4
>> f15bf1a54f2b31b559d/action/preview=https://imap.commvault.
>> com/webconsole/api/contentstore/publicshare/346714/file/1272e68278744
>> f
>> 15bf1a54f2b31b559d/action/download>
>> [Download]
>> > 4
>> 6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92
>> KB)
>>
>> Xavi,
>>  Please find attached the extended attributes for a
>> directory from all the bricks. Free space check failed for this with
>> error number EIO.
>
> What do you mean ? what operation have you made to check the free space on 
> that directory ?
>
> If it's a recursive check, I need the extended attributes from the exact file 
> that triggers the EIO. The attached attributes seem consistent and that 
> directory shouldn't cause any problem. Does an 'ls' on that directory fail or 
> does it show the contents ?
>
> Xavi
>
>>
>> Thanks and Regards,
>> Ram
>>
>> -Original Message-
>> From: Xavier Hernandez [mailto:xhernan...@datalab.es]
>> Sent: Tuesday, January 10, 2017 6:45 AM
>> To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
>> gluster-users@gluster.org
>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume
>>
>> Hi Ram,
>>
>> can you execute the following command on all bricks on a file that is
>> giving EIO ?
>>
>> getfattr -m. -e hex -d 
>>
>> Xavi
>>
>> On 10/01/17 12:41, 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

the error is caused by an extended attribute that does not match on all 
3 bricks of the disperse set. Most probable value is trusted.ec.version, 
but could be others.


At first sight, I don't see any change from 3.7.8 that could have caused 
this. I'll check again.


What kind of operations are you doing ? this can help me narrow the search.

Xavi

On 10/01/17 13:43, Ankireddypalle Reddy wrote:

Xavi,
  Thanks. If you could please explain what to look for in the extended 
attributes then I will check and let you know if I find anything suspicious.  
Also we noticed that some of these operations would succeed if retried. Do you 
know of any communicated related errors that are being reported/triaged.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 7:23 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:

Attachment (1):

1



ecxattrs.txt

[Download]
(5.92
KB)

Xavi,
 Please find attached the extended attributes for a
directory from all the bricks. Free space check failed for this with
error number EIO.


What do you mean ? what operation have you made to check the free space on that 
directory ?

If it's a recursive check, I need the extended attributes from the exact file 
that triggers the EIO. The attached attributes seem consistent and that 
directory shouldn't cause any problem. Does an 'ls' on that directory fail or 
does it show the contents ?

Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:45 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

can you execute the following command on all bricks on a file that is
giving EIO ?

getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
We have been running 3.7.8 on these servers. We upgraded

to 3.7.18 yesterday. We upgraded all the servers at a time.  The
volume was brought down during upgrade.


Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal

finished before upgrading the next server ?


Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of
failures in our applications. Most of the errors are EIO. The
following log lines are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and
[2017-01-10 02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and
[2017-01-10 02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

...

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash:
1 ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start:
536870911 ,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Ankireddypalle Reddy
Xavi,
  Thanks. If you could please explain what to look for in the extended 
attributes then I will check and let you know if I find anything suspicious.  
Also we noticed that some of these operations would succeed if retried. Do you 
know of any communicated related errors that are being reported/triaged.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es] 
Sent: Tuesday, January 10, 2017 7:23 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:
> Attachment (1):
>
> 1
>
>   
>
> ecxattrs.txt
>  mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e68278744
> f15bf1a54f2b31b559d/action/preview=https://imap.commvault.
> com/webconsole/api/contentstore/publicshare/346714/file/1272e68278744f
> 15bf1a54f2b31b559d/action/download>
> [Download]
>  6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92
> KB)
>
> Xavi,
>  Please find attached the extended attributes for a 
> directory from all the bricks. Free space check failed for this with 
> error number EIO.

What do you mean ? what operation have you made to check the free space on that 
directory ?

If it's a recursive check, I need the extended attributes from the exact file 
that triggers the EIO. The attached attributes seem consistent and that 
directory shouldn't cause any problem. Does an 'ls' on that directory fail or 
does it show the contents ?

Xavi

>
> Thanks and Regards,
> Ram
>
> -Original Message-
> From: Xavier Hernandez [mailto:xhernan...@datalab.es]
> Sent: Tuesday, January 10, 2017 6:45 AM
> To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
> gluster-users@gluster.org
> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume
>
> Hi Ram,
>
> can you execute the following command on all bricks on a file that is 
> giving EIO ?
>
> getfattr -m. -e hex -d 
>
> Xavi
>
> On 10/01/17 12:41, Ankireddypalle Reddy wrote:
>> Xavi,
>> We have been running 3.7.8 on these servers. We upgraded
> to 3.7.18 yesterday. We upgraded all the servers at a time.  The 
> volume was brought down during upgrade.
>>
>> Thanks and Regards,
>> Ram
>>
>> -Original Message-
>> From: Xavier Hernandez [mailto:xhernan...@datalab.es]
>> Sent: Tuesday, January 10, 2017 6:35 AM
>> To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
>> gluster-users@gluster.org
>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume
>>
>> Hi Ram,
>>
>> how did you upgrade gluster ? from which version ?
>>
>> Did you upgrade one server at a time and waited until self-heal
> finished before upgrading the next server ?
>>
>> Xavi
>>
>> On 10/01/17 11:39, Ankireddypalle Reddy wrote:
>>> Hi,
>>>
>>>   We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of 
>>> failures in our applications. Most of the errors are EIO. The 
>>> following log lines are commonly seen in the logs:
>>>
>>>
>>>
>>> The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
>>> 0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
>>> repeated 2 times between [2017-01-10 02:46:25.069809] and 
>>> [2017-01-10 02:46:25.069835]
>>>
>>> [2017-01-10 02:46:25.069852] W [MSGID: 122056] 
>>> [ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
>>> Mismatching xdata in answers of 'LOOKUP'
>>>
>>> The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
>>> 0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
>>> repeated 2 times between [2017-01-10 02:46:25.069852] and 
>>> [2017-01-10 02:46:25.069873]
>>>
>>> [2017-01-10 02:46:25.069910] W [MSGID: 122056] 
>>> [ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
>>> Mismatching xdata in answers of 'LOOKUP'
>>>
>>> ...
>>>
>>> [2017-01-10 02:46:26.520774] I [MSGID: 109036] 
>>> [dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
>>> 0-StoragePool-dht: Setting layout of
>>> /Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
>>> [Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
>>> Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
>>> -1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
>>> StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash:
>>> 1 ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 
>>> 536870911 ,
>>> Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
>>> -1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
>>> StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop:
>>> 2147483643 ,
>>> Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
>>> 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

On 10/01/17 13:14, Ankireddypalle Reddy wrote:

Attachment (1):

1



ecxattrs.txt

[Download]
(5.92
KB)

Xavi,
 Please find attached the extended attributes for a
directory from all the bricks. Free space check failed for this with
error number EIO.


What do you mean ? what operation have you made to check the free space 
on that directory ?


If it's a recursive check, I need the extended attributes from the exact 
file that triggers the EIO. The attached attributes seem consistent and 
that directory shouldn't cause any problem. Does an 'ls' on that 
directory fail or does it show the contents ?


Xavi



Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:45 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

can you execute the following command on all bricks on a file that is
giving EIO ?

getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
We have been running 3.7.8 on these servers. We upgraded

to 3.7.18 yesterday. We upgraded all the servers at a time.  The volume
was brought down during upgrade.


Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org);
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal

finished before upgrading the next server ?


Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of
failures in our applications. Most of the errors are EIO. The
following log lines are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10
02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10
02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

...

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash:
1 ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911
,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop:
2147483643 ,
Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop:
3221225465 ,
Hash: 1 ],

[2017-01-10 02:46:26.522841] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.522841] and [2017-01-10 02:46:26.522894]

[2017-01-10 02:46:26.522898] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
Failed to get size and version [Input/output error]

[2017-01-10 02:46:26.523115] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

can you execute the following command on all bricks on a file that is 
giving EIO ?


getfattr -m. -e hex -d 

Xavi

On 10/01/17 12:41, Ankireddypalle Reddy wrote:

Xavi,
We have been running 3.7.8 on these servers. We upgraded to 3.7.18 
yesterday. We upgraded all the servers at a time.  The volume was brought down 
during upgrade.

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es]
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal finished before 
upgrading the next server ?

Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of
failures in our applications. Most of the errors are EIO. The
following log lines are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10
02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10
02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

...

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash: 1
], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911 ,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop: 2147483643
,
Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop: 3221225465
,
Hash: 1 ],

[2017-01-10 02:46:26.522841] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.522841] and [2017-01-10 02:46:26.522894]

[2017-01-10 02:46:26.522898] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
Failed to get size and version [Input/output error]

[2017-01-10 02:46:26.523115] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-6: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.523115] and [2017-01-10 02:46:26.523143]

[2017-01-10 02:46:26.523147] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6:
Failed to get size and version [Input/output error]

[2017-01-10 02:46:26.523302] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-2: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
02:46:26.523302] and [2017-01-10 02:46:26.523324]

[2017-01-10 02:46:26.523328] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2:
Failed to get size and version [Input/output error]



[root@glusterfs3 Log_Files]# gluster --version

glusterfs 3.7.18 built on Dec  8 2016 06:34:26



[root@glusterfs3 Log_Files]# gluster volume info



Volume Name: StoragePool

Type: Distributed-Disperse

Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f

Status: Started

Number of Bricks: 8 x (2 + 1) = 24

Transport-type: tcp

Bricks:

Brick1: glusterfs1sds:/ws/disk1/ws_brick

Brick2: glusterfs2sds:/ws/disk1/ws_brick

Brick3: glusterfs3sds:/ws/disk1/ws_brick

Brick4: 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Xavier Hernandez

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal finished 
before upgrading the next server ?


Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:

Hi,

  We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of failures
in our applications. Most of the errors are EIO. The following log lines
are commonly seen in the logs:



The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10
02:46:25.069835]

[2017-01-10 02:46:25.069852] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
Mismatching xdata in answers of 'LOOKUP'

The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10
02:46:25.069873]

[2017-01-10 02:46:25.069910] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
Mismatching xdata in answers of 'LOOKUP'

…

[2017-01-10 02:46:26.520774] I [MSGID: 109036]
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
0-StoragePool-dht: Setting layout of
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
[Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
-1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash: 1
], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911 ,
Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
-1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop: 2147483643 ,
Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop: 3221225465 ,
Hash: 1 ],

[2017-01-10 02:46:26.522841] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 02:46:26.522841]
and [2017-01-10 02:46:26.522894]

[2017-01-10 02:46:26.522898] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3: Failed
to get size and version [Input/output error]

[2017-01-10 02:46:26.523115] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-6: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 02:46:26.523115]
and [2017-01-10 02:46:26.523143]

[2017-01-10 02:46:26.523147] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6: Failed
to get size and version [Input/output error]

[2017-01-10 02:46:26.523302] N [MSGID: 122031]
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
Mismatching dictionary in answers of 'GF_FOP_XATTROP'

The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-2: Mismatching dictionary in answers of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 02:46:26.523302]
and [2017-01-10 02:46:26.523324]

[2017-01-10 02:46:26.523328] W [MSGID: 122040]
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2: Failed
to get size and version [Input/output error]



[root@glusterfs3 Log_Files]# gluster --version

glusterfs 3.7.18 built on Dec  8 2016 06:34:26



[root@glusterfs3 Log_Files]# gluster volume info



Volume Name: StoragePool

Type: Distributed-Disperse

Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f

Status: Started

Number of Bricks: 8 x (2 + 1) = 24

Transport-type: tcp

Bricks:

Brick1: glusterfs1sds:/ws/disk1/ws_brick

Brick2: glusterfs2sds:/ws/disk1/ws_brick

Brick3: glusterfs3sds:/ws/disk1/ws_brick

Brick4: glusterfs1sds:/ws/disk2/ws_brick

Brick5: glusterfs2sds:/ws/disk2/ws_brick

Brick6: glusterfs3sds:/ws/disk2/ws_brick

Brick7: glusterfs1sds:/ws/disk3/ws_brick

Brick8: glusterfs2sds:/ws/disk3/ws_brick

Brick9: glusterfs3sds:/ws/disk3/ws_brick

Brick10: glusterfs1sds:/ws/disk4/ws_brick

Brick11: glusterfs2sds:/ws/disk4/ws_brick

Brick12: glusterfs3sds:/ws/disk4/ws_brick

Brick13: glusterfs1sds:/ws/disk5/ws_brick

Brick14: glusterfs2sds:/ws/disk5/ws_brick

Brick15: glusterfs3sds:/ws/disk5/ws_brick

Brick16: glusterfs1sds:/ws/disk6/ws_brick

Brick17: glusterfs2sds:/ws/disk6/ws_brick

Brick18: glusterfs3sds:/ws/disk6/ws_brick

Brick19: 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-10 Thread Ankireddypalle Reddy
Xavi,
We have been running 3.7.8 on these servers. We upgraded to 3.7.18 
yesterday. We upgraded all the servers at a time.  The volume was brought down 
during upgrade. 

Thanks and Regards,
Ram

-Original Message-
From: Xavier Hernandez [mailto:xhernan...@datalab.es] 
Sent: Tuesday, January 10, 2017 6:35 AM
To: Ankireddypalle Reddy; Gluster Devel (gluster-de...@gluster.org); 
gluster-users@gluster.org
Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

how did you upgrade gluster ? from which version ?

Did you upgrade one server at a time and waited until self-heal finished before 
upgrading the next server ?

Xavi

On 10/01/17 11:39, Ankireddypalle Reddy wrote:
> Hi,
>
>   We upgraded to GlusterFS 3.7.18 yesterday.  We see lot of 
> failures in our applications. Most of the errors are EIO. The 
> following log lines are commonly seen in the logs:
>
>
>
> The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
> 0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
> repeated 2 times between [2017-01-10 02:46:25.069809] and [2017-01-10 
> 02:46:25.069835]
>
> [2017-01-10 02:46:25.069852] W [MSGID: 122056] 
> [ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
> Mismatching xdata in answers of 'LOOKUP'
>
> The message "W [MSGID: 122056] [ec-combine.c:873:ec_combine_check]
> 0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
> repeated 2 times between [2017-01-10 02:46:25.069852] and [2017-01-10 
> 02:46:25.069873]
>
> [2017-01-10 02:46:25.069910] W [MSGID: 122056] 
> [ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
> Mismatching xdata in answers of 'LOOKUP'
>
> ...
>
> [2017-01-10 02:46:26.520774] I [MSGID: 109036] 
> [dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
> 0-StoragePool-dht: Setting layout of
> /Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 with
> [Subvol_name: StoragePool-disperse-0, Err: -1 , Start: 3221225466 ,
> Stop: 3758096376 , Hash: 1 ], [Subvol_name: StoragePool-disperse-1, Err:
> -1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
> StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash: 1 
> ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: 536870911 ,
> Stop: 1073741821 , Hash: 1 ], [Subvol_name: StoragePool-disperse-4, Err:
> -1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name:
> StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop: 2147483643 
> ,
> Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
> 2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name:
> StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop: 3221225465 
> ,
> Hash: 1 ],
>
> [2017-01-10 02:46:26.522841] N [MSGID: 122031] 
> [ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
> Mismatching dictionary in answers of 'GF_FOP_XATTROP'
>
> The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
> 0-StoragePool-disperse-3: Mismatching dictionary in answers of 
> 'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 
> 02:46:26.522841] and [2017-01-10 02:46:26.522894]
>
> [2017-01-10 02:46:26.522898] W [MSGID: 122040] 
> [ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3: 
> Failed to get size and version [Input/output error]
>
> [2017-01-10 02:46:26.523115] N [MSGID: 122031] 
> [ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
> Mismatching dictionary in answers of 'GF_FOP_XATTROP'
>
> The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
> 0-StoragePool-disperse-6: Mismatching dictionary in answers of 
> 'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 
> 02:46:26.523115] and [2017-01-10 02:46:26.523143]
>
> [2017-01-10 02:46:26.523147] W [MSGID: 122040] 
> [ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6: 
> Failed to get size and version [Input/output error]
>
> [2017-01-10 02:46:26.523302] N [MSGID: 122031] 
> [ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
> Mismatching dictionary in answers of 'GF_FOP_XATTROP'
>
> The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop]
> 0-StoragePool-disperse-2: Mismatching dictionary in answers of 
> 'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 
> 02:46:26.523302] and [2017-01-10 02:46:26.523324]
>
> [2017-01-10 02:46:26.523328] W [MSGID: 122040] 
> [ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2: 
> Failed to get size and version [Input/output error]
>
>
>
> [root@glusterfs3 Log_Files]# gluster --version
>
> glusterfs 3.7.18 built on Dec  8 2016 06:34:26
>
>
>
> [root@glusterfs3 Log_Files]# gluster volume info
>
>
>
> Volume Name: StoragePool
>
> Type: Distributed-Disperse
>
> Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f
>
> Status: Started
>
> Number of Bricks: 8 x (2 + 1) = 24
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: glusterfs1sds:/ws/disk1/ws_brick
>
> Brick2: glusterfs2sds:/ws/disk1/ws_brick
>
>