Re: [Gluster-users] CentOS Storage SIG Repo for 3.8 is behind 3 versions (3.8.5). Is it still being used and can I help?

2017-01-13 Thread Daryl lee
Thanks everyone for the information.   I'm happy to help provide test repo 
feedback on an ongoing basis to help things along which I'll do right now.

Niels,
If you require more or less information please let me know, happy to help.  
Thanks for doing the builds!

I deployed GlusterFS v3.8.8 successfully to 5 servers running CentOS Linux 
release 7.3.1611 (Core) here are the results.

2 GlusterFS clients deployed the following packages:
---
glusterfs x86_64   3.8.8-1.el7  
  centos-gluster38-test   509 k
glusterfs-api x86_64   3.8.8-1.el7  
  centos-gluster38-test89 k
glusterfs-client-xlators x86_64   3.8.8-1.el7   
 centos-gluster38-test   781 k
glusterfs-fuse  x86_64   3.8.8-1.el7
centos-gluster38-test   133 k
glusterfs-libsx86_64   3.8.8-1.el7  
  centos-gluster38-test   378 k

Tests:
---
Package DOWNLOAD/UPDATE/CLEANUP from repo  - SUCCESS
Basic FUSE mount RW test to remote GlusterFS volume - SUCCESS 
Boot and basic functionality test of libvirt gfapi based KVM Virtual Machine - 
SUCCESS


3 GlusterFS Brick/Volume servers running REPLICA 3 ARBITER 1 updated the 
following packages:
---
glusterfsx86_64 
3.8.8-1.el7 centos-gluster38-test 
509 k
glusterfs-apix86_64 3.8.8-1.el7 
centos-gluster38-test  89 k
glusterfs-cli  x86_64 
3.8.8-1.el7 centos-gluster38-test 
182 k
glusterfs-client-xlators   x86_64 3.8.8-1.el7   
  centos-gluster38-test 781 k
glusterfs-fusex86_64 3.8.8-1.el7
 centos-gluster38-test 133 k
glusterfs-libs   x86_64 3.8.8-1.el7 
centos-gluster38-test 378 k
glusterfs-server x86_64 3.8.8-1.el7 
centos-gluster38-test 1.4 M
userspace-rcux86_64 0.7.16-3.el7
centos-gluster38-test  72 k

Tests:
---
Package DOWNLOAD/UPDATE/CLEANUP from repo - SUCCESS w/ warnings
*  warning during updating of glusterfs-server-3.8.8-1.el7.x86_64 backing 
up gluster .vol files saved to rpmsave.   This is expected.
Bricks on all 3 servers started - SUCCESS
Self Healing Daemon on all 3 servers started - SUCCESS
Bitrot Daemon on all 3 servers started - SUCCESS
Scrubber Daemon on all 3 servers started - SUCCESS
First replica self healing - success
Second replica self healing - success
Arbiter replica self healing - success


-Daryl


-Original Message-
From: Kaushal M [mailto:kshlms...@gmail.com] 
Sent: Friday, January 13, 2017 5:03 AM
To: Daryl lee
Cc: Pavel Szalbot; gluster-users; Niels de Vos
Subject: Re: [Gluster-users] CentOS Storage SIG Repo for 3.8 is behind 3 
versions (3.8.5). Is it still being used and can I help?

Packages for 3.7, 3.8 and 3.9 are being built for the Storage SIG.
Niels is very punctual about building them. The packages first land in the 
respective testing repositories. If someone verifies that the packages are 
okay, and gives Niels a heads-up, he pushes the packages to be signed and added 
to the release repositories.

The only issue is that Niels doesn't get enough (or any) verifications. And the 
packages linger in testing.

On Fri, Jan 13, 2017 at 3:31 PM, Pavel Szalbot  wrote:
> Hi, you can install 3.8.7 from centos-gluster38-test using:
>
> yum --enablerepo=centos-gluster38-test install glusterfs
>
> I am not sure how QA works for CentOS Storage SIG, but 3.8.7 works 
> same as
> 3.8.5 for me - libvirt gfapi is unfortunately broken, no other 
> problems detected.
>
> Btw 3.9 is short term maintenance release 
> 

Re: [Gluster-users] [ovirt-users] vdsm IOProcessClient WARNING Timeout waiting for communication thread for client

2017-01-13 Thread Nir Soffer
On Fri, Jan 13, 2017 at 7:41 PM, Bill James  wrote:
> resending without logs, except vdsm.log since list limit is too small.
>
>
>
> On 1/13/17 8:50 AM, Bill James wrote:
>
> We have an ovirt system with 3 clusters, all running centos7.
> ovirt engine is running on separate host,
> ovirt-engine-3.6.4.1-1.el7.centos.noarch
> 2 of the clusters are running newer version of ovirt, 3 nodes each,
> ovirt-engine-4.0.3-1.el7.centos.noarch, glusterfs-3.7.16-1.el7.x86_64,
> vdsm-4.18.11-1.el7.centos.x86_64.
> 1 cluster is still running the older version,
> ovirt-engine-3.6.4.1-1.el7.centos.noarch.

Which ioprocess version?

>
> Yes we are in the process of upgrading the whole system to ovirt4.0, but
> takes time
>
> One of the 2 clusters running ovirt4 is complaining of timeouts, vdsm
> talking to gluster. No warnings on the 2 other clusters.
>
>
>
> Thread-720062::DEBUG::2017-01-13
> 07:29:46,814::outOfProcess::87::Storage.oop::(getProcessPool) Creating
> ioprocess /rhev/data-center/mnt/glusterSD/ovirt1-gl.dmz.p
> rod.j2noc.com:_gv1
> Thread-720062::INFO::2017-01-13
> 07:29:46,814::__init__::325::IOProcessClient::(__init__) Starting client
> ioprocess-5874
> Thread-720062::DEBUG::2017-01-13
> 07:29:46,814::__init__::334::IOProcessClient::(_run) Starting ioprocess for
> client ioprocess-5874
> Thread-720062::DEBUG::2017-01-13
> 07:29:46,832::__init__::386::IOProcessClient::(_startCommunication) Starting
> communication thread for client ioprocess-5874
> Thread-720062::WARNING::2017-01-13
> 07:29:46,847::__init__::401::IOProcessClient::(_startCommunication) Timeout
> waiting for communication thread for client ioprocess-5874

This warning is harmless, it means that ioprocess thread did not start
in 1 second.

This probably means that the host is overloaded, typically new threads start
instantly.

Anyway I think we are using too short timeout. Can you open an ioprocess
bug for this?

>
>
> [2017-01-12 07:27:58.685680] I [MSGID: 106488]
> [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd:
> Received get vol req
> The message "I [MSGID: 106488]
> [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd:
> Received get vol req" repeated 31 times between [2017-01-12 07:27:58.685680]
> and [2017-01-12 07:29:46.971939]
>
>
> attached logs: engine.log supervdsm.log vdsm.log
> etc-glusterfs-glusterd.vol.log cli.log
>
>
>
> ___
> Users mailing list
> us...@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS service dying

2017-01-13 Thread Giuseppe Ragusa
On Fri, Jan 13, 2017, at 12:39, Niels de Vos wrote:
> On Wed, Jan 11, 2017 at 11:58:29AM -0700, Paul Allen wrote:
> > I'm running into an issue where the gluster nfs service keeps dying on a
> > new cluster I have setup recently. We've been using Gluster on several
> > other clusters now for about a year or so and I have never seen this
> > issue before, nor have I been able to find anything remotely similar to
> > it while searching on-line. I initially was using the latest version in
> > the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
> > using the next one down, 3.8.7-1. Both behave the same for me.
> > 
> > What I was seeing was after a while the nfs service on the NAS server
> > would suddenly die after a number of processes had run on the app server
> > I had connected to the new NAS servers for testing (we're upgrading the
> > NAS servers for this cluster to newer hardware and expanded storage, the
> > current production NAS servers are using nfs-kernel-server with no type
> > of clustering of the data). I checked the logs but all it showed me was
> > something that looked like a stack trace in the nfs.log and the
> > glustershd.log showed the nfs service disconnecting. I turned on
> > debugging but it didn't give me a whole lot more, and certainly nothing
> > that helps me identify the source of my issue. It is pretty consistent
> > in dying shortly after I mount the file system on the servers and start
> > testing, usually within 15-30 minutes. But if I have nothing using the
> > file system, mounted or no, the service stays running for days. I tried
> > mounting it using the gluster client, and it works fine, but I can;t use
> > that due to the performance penalty, it slows the websites down by a few
> > seconds at a minimum.
> 
> This seems to be related to the NLM protocol that Gluster/NFS provides.
> Earlier this week one of our Red Hat quality engineers also reported
> this (or a very similar) problem.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1411344
> 
> At the moment I suspect that this is related to re-connects of some
> kind, but I have not been able to identify the cause sufficiently to be
> sure. This definitely is a coding problem in Gluster/NFS, but the more I
> look at the NLM implementation, the more potential issues I see with it.

Should we assume that, with the complete removal of Gluster/NFS already on the 
horizon, debugging and fixing NLM (even if only for the more common, reported 
crash cases) would be an extremely low-priority task? ;-)

Would it be possible for someone to simply check whether the crashes happen 
also on the 3.7.9-12 codebase used in latest RHGS 3.1.3?
My cluster is already at 3.7.12 feature level (and using it), so I suppose I 
could not easily downgrade.
Since Red Hat QA found a similar problem while testing the 3.8.4-10 codebase in 
RHGS 3.2.0, we could trace the problem back to post-3.7.9 developments, if RHGS 
3.1.3 is immune.

> If the workload does not require locking operations, you may be able to
> work around the problem by mounting with "-o nolock". Depending on the
> application, this can be safe or cause data corruption...

If I'm not mistaken, typical NFS uses such as YUM repositories and home 
directories would be barred (I think that SQLite needs locking and both 
createrepo and firefox use SQLite, right?).

> An other alternative is to use NFS-Ganesha instead of Gluster/NFS.
> Ganesha is more mature than Gluster/NFS and is more actively developed.
> Gluster/NFS is being deprecated in favour of NFS-Ganesha.

Pure storage infrastructure uses should be migratable, I suppose, but extended 
testing and a suitable maintenance window (a rolling live migration from 
Gluster/NFS is not feasible, if I understood Ganesha right) would be needed 
anyway, I suppose.

More "extreme" uses (such as mine, unfortunately: hyperconverged Gluster+oVirt 
setup coexisting with full CIFS/NFS sharing services) have not been 
documented/considered for Ganesha, according to my own research on the case 
(but please correct me if I'm wrong).
Since I already envisioned such an outcome, I recently posted a request for 
info/directions on such a migration in my particular case:

http://lists.gluster.org/pipermail/gluster-users/2017-January/029650.html

Can anyone from the developers camp kindly comment on the above points? :-)

Many many thanks in advance.

Best regards,
Giuseppe

> HTH,
> Niels
> 
> 
> > 
> > Here is the output from the logs one of the times it died:
> > 
> > glustershd.log:
> > 
> > [2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
> > on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
> > data available)
> > [2017-01-10 19:06:20.265964] I [MSGID: 106006]
> > [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
> > nfs has disconnected from glusterd.
> > 
> > 
> > nfs.log:
> > 
> > [2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
> > 

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-13 Thread Ankireddypalle Reddy
Xavi,
 I enabled TRACE logging. The log grew up to 120GB and could not 
make much out of it. Then I started logging GFID in the code where we were 
seeing errors.

[2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761365] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8)
[2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14 
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64 
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
[2017-01-13 17:02:01.761433] W [MSGID: 122056] 
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching 
xdata in answers of 'LOOKUP'
[2017-01-13 17:02:01.761442] W [MSGID: 122006] 
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
combine iatt (inode: 11275691004192850514-11275691004192850514, gfid: 
60b990ed-d741-4176-9c7b-4d3a25fb8252  -  60b990ed-d741-4176-9c7b-4d3a25fb8252,  
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648, mode: 
100775-100775)

The file for which we are seeing this error turns out to be having a GFID of 
60b990ed-d741-4176-9c7b-4d3a25fb8252  

Then I tried looking for find out the file with this GFID. It pointed me to 
following path. I was expecting a real file system path from the following 
turorial:
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/

getfattr -n trusted.glusterfs.pathinfo -e text 
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.glusterfs.pathinfo="( 
( 

 
))"

Then I looked for the xatttrs for these files from all the 3 bricks

[root@glusterfs4 glusterfs]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc00041138
trusted.ec.config=0x080301000200
trusted.ec.size=0x
trusted.ec.version=0x2a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs5 bricks]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c92d0
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

[root@glusterfs6 ee]# getfattr -d -m . -e hex 
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
getfattr: Removing leading '/' from absolute path names
# file: ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
trusted.bit-rot.version=0x02005877a8dc000c9436
trusted.ec.config=0x080301000200
trusted.ec.dirty=0x0016
trusted.ec.size=0x306b
trusted.ec.version=0x2a382a38
trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252

It turns out that the size and version in fact does not match for one of the 
files. 

Thanks and Regards,
Ram

-Original Message-
From: gluster-devel-boun...@gluster.org 
[mailto:gluster-devel-boun...@gluster.org] On Behalf Of Ankireddypalle Reddy
Sent: Friday, January 13, 2017 4:17 AM
To: Xavier Hernandez
Cc: gluster-users@gluster.org; Gluster Devel (gluster-de...@gluster.org)
Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in disperse 
volume

Xavi,
Thanks for explanation. Will collect TRACE logs today. 

Thanks and Regards,
Ram

Sent from my iPhone

> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez  wrote:
> 
> Hi Ram,
> 
>> On 

Re: [Gluster-users] WRITE => -1 (Input/output error)

2017-01-13 Thread Stephen Martin
Thanks thats excellent. I will give it a go over weekend.

Much appreciated.

> On 13 Jan 2017, at 16:04, Ben Werthmann  wrote:
> 
> I debug fuse mounts by running glusterfs binary directly with the below 
> command. I find it works best to incorporate this command into a test script.
> 
> 'glusterfs --volfile-server=$onevolumeserver --log-file=$logfile 
> --log-level=$level --volfile-id=$volname $mountpoint`
> 
> I like to use '--no-daemon', or the '--debug'[1]. For $level, start with 
> DEBUG, then try TRACE. TRACE is _very_ verbose.
> 
> For more info, see: 'glusterfs --help' or 'man glusterfs'.
> 
> The mount.glusterfs handler may be reviewed via:  'less -S $(which 
> mount.glusterfs)'
> 
> [1] Run in debug mode.  This option sets --no-daemon, --log-level to DEBUG, 
> and --log-file to console.
> 
> On Fri, Jan 13, 2017 at 7:25 AM, Stephen Martin  > wrote:
> None that I am aware of, would I be looking in Gluster logs to see EIO?
> 
> I had a look at the source code for fuse_writev_cbk to see if I could give me 
> some hints as to where the issue is however no joy.
> 
> Would there be somewhere that would tell me what operation is failing what 
> file etc? Maybe I can turn up logging level.
> 
> > On 13 Jan 2017, at 12:14, Mohammed Rafi K C  > > wrote:
> >
> > the writes might have failed with EIO. Was there any network failure
> > back and forth ?
> >
> >
> > RafI KC
> >
> >
> > On 01/12/2017 07:50 PM, Stephen Martin wrote:
> >> Hi looking for help with an error I’m getting.
> >>
> >> I’m new to Gluster but have gone though most of the getting started and 
> >> overview documentation.
> >>
> >> I have been given a production server that uses Gluster and although its 
> >> up and running and fuctioning I see lots of errors in the logs
> >>
> >>[2017-01-12 13:19:37.326562] W [fuse-bridge.c:2167:fuse_writev_cbk] 
> >> 0-glusterfs-fuse: 5852034: WRITE => -1 (Input/output error)
> >>
> >> I don't really know what action caused it, I get lots of these logged how 
> >> can I start a diagnostic?
> >>
> >> Thanks
> >>
> >> .
> >> ___
> >> Gluster-users mailing list
> >> Gluster-users@gluster.org 
> >> http://www.gluster.org/mailman/listinfo/gluster-users 
> >> 
> >
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org 
> http://www.gluster.org/mailman/listinfo/gluster-users 
> 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] WRITE => -1 (Input/output error)

2017-01-13 Thread Ben Werthmann
Oops, ignore the duplicate '--log-level=$level'.

On Fri, Jan 13, 2017 at 11:04 AM, Ben Werthmann  wrote:

> I debug fuse mounts by running glusterfs binary directly with the below
> command. I find it works best to incorporate this command into a test
> script.
>
> 'glusterfs --volfile-server=$onevolumeserver --log-file=$logfile
> --log-level=$level --volfile-id=$volname $mountpoint`
>
> I like to use '--no-daemon', or the '--debug'[1]. For $level, start
> with DEBUG, then try TRACE. TRACE is _very_ verbose.
>
> For more info, see: 'glusterfs --help' or 'man glusterfs'.
>
> The mount.glusterfs handler may be reviewed via:  'less -S $(which
> mount.glusterfs)'
>
> [1] Run in debug mode.  This option sets --no-daemon, --log-level to
> DEBUG, and --log-file to console.
>
> On Fri, Jan 13, 2017 at 7:25 AM, Stephen Martin  wrote:
>
>> None that I am aware of, would I be looking in Gluster logs to see EIO?
>>
>> I had a look at the source code for fuse_writev_cbk to see if I could
>> give me some hints as to where the issue is however no joy.
>>
>> Would there be somewhere that would tell me what operation is failing
>> what file etc? Maybe I can turn up logging level.
>>
>> > On 13 Jan 2017, at 12:14, Mohammed Rafi K C 
>> wrote:
>> >
>> > the writes might have failed with EIO. Was there any network failure
>> > back and forth ?
>> >
>> >
>> > RafI KC
>> >
>> >
>> > On 01/12/2017 07:50 PM, Stephen Martin wrote:
>> >> Hi looking for help with an error I’m getting.
>> >>
>> >> I’m new to Gluster but have gone though most of the getting started
>> and overview documentation.
>> >>
>> >> I have been given a production server that uses Gluster and although
>> its up and running and fuctioning I see lots of errors in the logs
>> >>
>> >>[2017-01-12 13:19:37.326562] W [fuse-bridge.c:2167:fuse_writev_cbk]
>> 0-glusterfs-fuse: 5852034: WRITE => -1 (Input/output error)
>> >>
>> >> I don't really know what action caused it, I get lots of these logged
>> how can I start a diagnostic?
>> >>
>> >> Thanks
>> >>
>> >> .
>> >> ___
>> >> Gluster-users mailing list
>> >> Gluster-users@gluster.org
>> >> http://www.gluster.org/mailman/listinfo/gluster-users
>> >
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] WRITE => -1 (Input/output error)

2017-01-13 Thread Ben Werthmann
I debug fuse mounts by running glusterfs binary directly with the below
command. I find it works best to incorporate this command into a test
script.

'glusterfs --volfile-server=$onevolumeserver --log-file=$logfile
--log-level=$level --volfile-id=$volname $mountpoint`

I like to use '--no-daemon', or the '--debug'[1]. For $level, start
with DEBUG, then try TRACE. TRACE is _very_ verbose.

For more info, see: 'glusterfs --help' or 'man glusterfs'.

The mount.glusterfs handler may be reviewed via:  'less -S $(which
mount.glusterfs)'

[1] Run in debug mode.  This option sets --no-daemon, --log-level to DEBUG,
and --log-file to console.

On Fri, Jan 13, 2017 at 7:25 AM, Stephen Martin  wrote:

> None that I am aware of, would I be looking in Gluster logs to see EIO?
>
> I had a look at the source code for fuse_writev_cbk to see if I could give
> me some hints as to where the issue is however no joy.
>
> Would there be somewhere that would tell me what operation is failing what
> file etc? Maybe I can turn up logging level.
>
> > On 13 Jan 2017, at 12:14, Mohammed Rafi K C  wrote:
> >
> > the writes might have failed with EIO. Was there any network failure
> > back and forth ?
> >
> >
> > RafI KC
> >
> >
> > On 01/12/2017 07:50 PM, Stephen Martin wrote:
> >> Hi looking for help with an error I’m getting.
> >>
> >> I’m new to Gluster but have gone though most of the getting started and
> overview documentation.
> >>
> >> I have been given a production server that uses Gluster and although
> its up and running and fuctioning I see lots of errors in the logs
> >>
> >>[2017-01-12 13:19:37.326562] W [fuse-bridge.c:2167:fuse_writev_cbk]
> 0-glusterfs-fuse: 5852034: WRITE => -1 (Input/output error)
> >>
> >> I don't really know what action caused it, I get lots of these logged
> how can I start a diagnostic?
> >>
> >> Thanks
> >>
> >> .
> >> ___
> >> Gluster-users mailing list
> >> Gluster-users@gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> >
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] CentOS Storage SIG Repo for 3.8 is behind 3 versions (3.8.5). Is it still being used and can I help?

2017-01-13 Thread Kaushal M
Packages for 3.7, 3.8 and 3.9 are being built for the Storage SIG.
Niels is very punctual about building them. The packages first land in
the respective testing repositories. If someone verifies that the
packages are okay, and gives Niels a heads-up, he pushes the packages
to be signed and added to the release repositories.

The only issue is that Niels doesn't get enough (or any)
verifications. And the packages linger in testing.

On Fri, Jan 13, 2017 at 3:31 PM, Pavel Szalbot  wrote:
> Hi, you can install 3.8.7 from centos-gluster38-test using:
>
> yum --enablerepo=centos-gluster38-test install glusterfs
>
> I am not sure how QA works for CentOS Storage SIG, but 3.8.7 works same as
> 3.8.5 for me - libvirt gfapi is unfortunately broken, no other problems
> detected.
>
> Btw 3.9 is short term maintenance release
> (https://lists.centos.org/pipermail/centos-devel/2016-September/015197.html).
>
>
> -ps
>
> On Fri, Jan 13, 2017 at 1:18 AM, Daryl lee  wrote:
>>
>> Hey Gluster Community,
>>
>> According to the community packages list I get the impression that 3.8
>> would be released to the CentOS Storage SIG Repo, but this seems to have
>> stopped with 3.8.5 and 3.9 is still missing all together.   However, 3.7 is
>> still being updated and is at 3.7.8 so I am confused why the other two
>> versions have stopped.
>>
>>
>>
>> I did some looking on the past posts to this list and found a conversation
>> about 3.9 on the CentOS repo last year but it looks like it's still not up
>> yet; possibly due to a lack of community involvement in the testing and
>> reporting back to whoever the maintainer is (which we don’t know yet)?   I
>> might be in a position to help since I have a test environment that mirrors
>> my production environment setup that I would use for testing the patch
>> anyways, I might as well provide some good to the community..   At this
>> point I know to do " yum install --enablerepo=centos-gluster38-test
>> glusterfs-server" but I'm not sure who to tell if it works or not, and what
>> kind of info they are looking for.If someone wanted to give me a little
>> guidance that would be awesome, especially if it will save me from having to
>> switch to manually downloading packages.
>>
>>
>>
>> I guess the basic question is do we expect releases to resume for 3.8 on
>> the CentOS Storage SIG repo or should I be looking to move to manual
>> patching for 3.8.  Additionally, if the person who does the releases to the
>> CentOS Storage SIG is waiting for someone to tell them it looks fine,  who
>> should I contact to do so?
>>
>>
>>
>>
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Daryl
>>
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] WRITE => -1 (Input/output error)

2017-01-13 Thread Stephen Martin
None that I am aware of, would I be looking in Gluster logs to see EIO? 

I had a look at the source code for fuse_writev_cbk to see if I could give me 
some hints as to where the issue is however no joy.

Would there be somewhere that would tell me what operation is failing what file 
etc? Maybe I can turn up logging level.

> On 13 Jan 2017, at 12:14, Mohammed Rafi K C  wrote:
> 
> the writes might have failed with EIO. Was there any network failure
> back and forth ?
> 
> 
> RafI KC
> 
> 
> On 01/12/2017 07:50 PM, Stephen Martin wrote:
>> Hi looking for help with an error I’m getting.
>> 
>> I’m new to Gluster but have gone though most of the getting started and 
>> overview documentation. 
>> 
>> I have been given a production server that uses Gluster and although its up 
>> and running and fuctioning I see lots of errors in the logs
>> 
>>[2017-01-12 13:19:37.326562] W [fuse-bridge.c:2167:fuse_writev_cbk] 
>> 0-glusterfs-fuse: 5852034: WRITE => -1 (Input/output error) 
>> 
>> I don't really know what action caused it, I get lots of these logged how 
>> can I start a diagnostic?
>> 
>> Thanks
>> 
>> .
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
> 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] WRITE => -1 (Input/output error)

2017-01-13 Thread Mohammed Rafi K C
the writes might have failed with EIO. Was there any network failure
back and forth ?


RafI KC


On 01/12/2017 07:50 PM, Stephen Martin wrote:
> Hi looking for help with an error I’m getting.
>
> I’m new to Gluster but have gone though most of the getting started and 
> overview documentation. 
>
> I have been given a production server that uses Gluster and although its up 
> and running and fuctioning I see lots of errors in the logs
>
> [2017-01-12 13:19:37.326562] W [fuse-bridge.c:2167:fuse_writev_cbk] 
> 0-glusterfs-fuse: 5852034: WRITE => -1 (Input/output error) 
>
> I don't really know what action caused it, I get lots of these logged how can 
> I start a diagnostic?
>
> Thanks
>
> .
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS service dying

2017-01-13 Thread Niels de Vos
On Wed, Jan 11, 2017 at 11:58:29AM -0700, Paul Allen wrote:
> I'm running into an issue where the gluster nfs service keeps dying on a
> new cluster I have setup recently. We've been using Gluster on several
> other clusters now for about a year or so and I have never seen this
> issue before, nor have I been able to find anything remotely similar to
> it while searching on-line. I initially was using the latest version in
> the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
> using the next one down, 3.8.7-1. Both behave the same for me.
> 
> What I was seeing was after a while the nfs service on the NAS server
> would suddenly die after a number of processes had run on the app server
> I had connected to the new NAS servers for testing (we're upgrading the
> NAS servers for this cluster to newer hardware and expanded storage, the
> current production NAS servers are using nfs-kernel-server with no type
> of clustering of the data). I checked the logs but all it showed me was
> something that looked like a stack trace in the nfs.log and the
> glustershd.log showed the nfs service disconnecting. I turned on
> debugging but it didn't give me a whole lot more, and certainly nothing
> that helps me identify the source of my issue. It is pretty consistent
> in dying shortly after I mount the file system on the servers and start
> testing, usually within 15-30 minutes. But if I have nothing using the
> file system, mounted or no, the service stays running for days. I tried
> mounting it using the gluster client, and it works fine, but I can;t use
> that due to the performance penalty, it slows the websites down by a few
> seconds at a minimum.

This seems to be related to the NLM protocol that Gluster/NFS provides.
Earlier this week one of our Red Hat quality engineers also reported
this (or a very similar) problem.

https://bugzilla.redhat.com/show_bug.cgi?id=1411344

At the moment I suspect that this is related to re-connects of some
kind, but I have not been able to identify the cause sufficiently to be
sure. This definitely is a coding problem in Gluster/NFS, but the more I
look at the NLM implementation, the more potential issues I see with it.

If the workload does not require locking operations, you may be able to
work around the problem by mounting with "-o nolock". Depending on the
application, this can be safe or cause data corruption...

An other alternative is to use NFS-Ganesha instead of Gluster/NFS.
Ganesha is more mature than Gluster/NFS and is more actively developed.
Gluster/NFS is being deprecated in favour of NFS-Ganesha.

HTH,
Niels


> 
> Here is the output from the logs one of the times it died:
> 
> glustershd.log:
> 
> [2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
> on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
> data available)
> [2017-01-10 19:06:20.265964] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
> nfs has disconnected from glusterd.
> 
> 
> nfs.log:
> 
> [2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
> 0-NLM-client: address-family not specified, marking it as unspec for
> getaddrinfo to resolve from (remote-host: 10.20.5.13)
> [2017-01-10 19:06:20.135531] D [MSGID: 0]
> [common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
> (port-48963) for hostname: 10.20.5.13 and port: 48963
> [2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
> 0-logging-infra: Log buffer size reduced. About to flush 5 extra log
> messages
> [2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
> 0-logging-infra: Just flushed 5 extra log messages
> pending frames:
> frame : type(0) op(0)
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 11
> time of crash:
> 2017-01-10 19:06:20
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.9.0
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac]
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15]
> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da]
> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07]
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836]
> 

Re: [Gluster-users] CentOS Storage SIG Repo for 3.8 is behind 3 versions (3.8.5). Is it still being used and can I help?

2017-01-13 Thread Pavel Szalbot
Hi, you can install 3.8.7 from centos-gluster38-test using:

yum --enablerepo=centos-gluster38-test install glusterfs

I am not sure how QA works for CentOS Storage SIG, but 3.8.7 works same as
3.8.5 for me - libvirt gfapi is unfortunately broken, no other problems
detected.

Btw 3.9 is short term maintenance release (
https://lists.centos.org/pipermail/centos-devel/2016-September/015197.html).


-ps

On Fri, Jan 13, 2017 at 1:18 AM, Daryl lee  wrote:

> Hey Gluster Community,
>
> According to the community packages list I get the impression that 3.8
> would be released to the CentOS Storage SIG Repo, but this seems to have
> stopped with 3.8.5 and 3.9 is still missing all together.   However, 3.7 is
> still being updated and is at 3.7.8 so I am confused why the other two
> versions have stopped.
>
>
>
> I did some looking on the past posts to this list and found a conversation
> about 3.9 on the CentOS repo last year but it looks like it's still not up
> yet; possibly due to a lack of community involvement in the testing and
> reporting back to whoever the maintainer is (which we don’t know yet)?   I
> might be in a position to help since I have a test environment that mirrors
> my production environment setup that I would use for testing the patch
> anyways, I might as well provide some good to the community..   At this
> point I know to do " yum install --enablerepo=centos-gluster38-test
> glusterfs-server" but I'm not sure who to tell if it works or not, and what
> kind of info they are looking for.If someone wanted to give me a little
> guidance that would be awesome, especially if it will save me from having
> to switch to manually downloading packages.
>
>
>
> I guess the basic question is do we expect releases to resume for 3.8 on
> the CentOS Storage SIG repo or should I be looking to move to manual
> patching for 3.8.  Additionally, if the person who does the releases to the
> CentOS Storage SIG is waiting for someone to tell them it looks fine,  who
> should I contact to do so?
>
>
>
>
>
>
>
> Thanks!
>
>
>
> Daryl
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-13 Thread Ankireddypalle Reddy
Xavi,
Thanks for explanation. Will collect TRACE logs today. 

Thanks and Regards,
Ram

Sent from my iPhone

> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez  wrote:
> 
> Hi Ram,
> 
>> On 12/01/17 22:14, Ankireddypalle Reddy wrote:
>> Xavi,
>> I changed the logging to log the individual bytes. Consider the 
>> following from ws-glus.log file where /ws/glus is the mount point.
>> 
>> [2017-01-12 20:47:59.368102] I [MSGID: 109063] 
>> [dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found anomalies 
>> in /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid = 
>> e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0
>> [2017-01-12 20:47:59.391218] I [MSGID: 109036] 
>> [dht-common.c:9082:dht_log_new_layout_for_dir_selfheal] 0-glusterfsProd-dht: 
>> Setting layout of /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 
>> with [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 , 
>> Stop: 2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1, Err: 
>> -1 , Start: 2863311528 , Stop: 3221225468 , Hash: 1 ], [Subvol_name: 
>> glusterfsProd-disperse-10, Err: -1 , Start: 3221225469 , Stop: 3579139409 , 
>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start: 
>> 3579139410 , Stop: 3937053350 , Hash: 1 ], [Subvol_name: 
>> glusterfsProd-disperse-2, Err: -1 , Start: 3937053351 , Stop: 4294967295 , 
>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-3, Err: -1 , Start: 0 , 
>> Stop: 357913940 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err: -1 
>> , Start: 357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name: 
>> glusterfsProd-disperse-5, Err: -1 , Start: 715827882 , Stop: 1073741822 , 
>> Hash
 : 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , 
Stop: 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , 
Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop: 2147483645 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 
, Stop: 2505397586 , Hash: 1 ],
>> 
>>Self-heal seems to be triggered for path  
>> /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies 
>> as per DHT.  It would be great if someone could explain what could be the 
>> anomaly here.  The setup where we encountered this is a fairly stable setup 
>> with no brick failures or no node failures.
> 
> This is not really a self-heal, at least from the point of view of ec. This 
> means that DHT has found a discrepancy in the layout of that directory, 
> however this doesn't mean any problem (notice the 'I' in the log, meaning 
> that it's informative, not a warning nor error).
> 
> Not sure how DHT works in this case or why it finds this "anomaly", but if 
> there aren't any previous errors before that message, it can be completely 
> ignored.
> 
> Not sure if it can be related to option cluster.weighted-rebalance that is 
> enabled by default.
> 
>> 
>>   Then Self-heal seems to have encountered the following error.
>> 
>> [2017-01-12 20:48:23.418432] I [dict.c:166:key_value_cmp] 
>> 0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts 
>> (16, 16)
>> [2017-01-12 20:48:23.418496] I [dict.c:3065:dict_dump_to_log] 
>> 0-glusterfsProd-disperse-2: dict=0x7f0b649520ac 
>> ((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:48:23.418519] I [dict.c:3065:dict_dump_to_log] 
>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0 
>> ((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:48:23.418531] W [MSGID: 122056] 
>> [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching 
>> xdata in answers of 'LOOKUP'
> 
> That's a real problem. Here we have two bricks that differ in the 
> trusted.ec.version xattr. However this xattr not necessarily belongs to the 
> previous directory. They are unrelated messages.
> 
>> 
>> In this case glusterfsProd-disperse-2 sub volume actually 
>> consists of the following bricks.
>> glusterfs4sds:/ws/disk11/ws_brick, glusterfs5sds: 
>> /ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick
>> 
>> I went ahead and checked the value of trusted.ec.version on all 
>> the 3 bricks inside this sub vol:
>> 
>> [root@glusterfs6 ~]# getfattr -e hex -n trusted.ec.version 
>> /ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> # file: 
>> ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> trusted.ec.version=0x0009000b
>> 
>> [root@glusterfs4 ~]# getfattr -e hex -n trusted.ec.version 
>> /ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>

Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

2017-01-13 Thread Xavier Hernandez

Hi Ram,

On 12/01/17 22:14, Ankireddypalle Reddy wrote:

Xavi,
 I changed the logging to log the individual bytes. Consider the 
following from ws-glus.log file where /ws/glus is the mount point.

[2017-01-12 20:47:59.368102] I [MSGID: 109063] 
[dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found anomalies in 
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid = 
e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0
[2017-01-12 20:47:59.391218] I [MSGID: 109036] 
[dht-common.c:9082:dht_log_new_layout_for_dir_selfheal] 0-glusterfsProd-dht: 
Setting layout of /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 
with [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 , 
Stop: 2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1, Err: -1 , 
Start: 2863311528 , Stop: 3221225468 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-10, Err: -1 , Start: 3221225469 , Stop: 3579139409 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start: 3579139410 
, Stop: 3937053350 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err: -1 
, Start: 3937053351 , Stop: 4294967295 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop: 357913940 , Hash: 1 ], 
[Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start: 357913941 , Stop: 
715827881 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-5, Err: -1 , Start: 
715827882 , Stop: 1073741822 , Hash:

 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , 
Stop: 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , 
Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], [Subvol_name: 
glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop: 2147483645 , 
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 
, Stop: 2505397586 , Hash: 1 ],


Self-heal seems to be triggered for path  
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as 
per DHT.  It would be great if someone could explain what could be the anomaly 
here.  The setup where we encountered this is a fairly stable setup with no 
brick failures or no node failures.


This is not really a self-heal, at least from the point of view of ec. 
This means that DHT has found a discrepancy in the layout of that 
directory, however this doesn't mean any problem (notice the 'I' in the 
log, meaning that it's informative, not a warning nor error).


Not sure how DHT works in this case or why it finds this "anomaly", but 
if there aren't any previous errors before that message, it can be 
completely ignored.


Not sure if it can be related to option cluster.weighted-rebalance that 
is enabled by default.




   Then Self-heal seems to have encountered the following error.

[2017-01-12 20:48:23.418432] I [dict.c:166:key_value_cmp] 
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 
16)
[2017-01-12 20:48:23.418496] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b649520ac 
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
[2017-01-12 20:48:23.418519] I [dict.c:3065:dict_dump_to_log] 
0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0 
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
[2017-01-12 20:48:23.418531] W [MSGID: 122056] 
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching 
xdata in answers of 'LOOKUP'


That's a real problem. Here we have two bricks that differ in the 
trusted.ec.version xattr. However this xattr not necessarily belongs to 
the previous directory. They are unrelated messages.




 In this case glusterfsProd-disperse-2 sub volume actually consists 
of the following bricks.
 glusterfs4sds:/ws/disk11/ws_brick, glusterfs5sds: 
/ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick

 I went ahead and checked the value of trusted.ec.version on all 
the 3 bricks inside this sub vol:

 [root@glusterfs6 ~]# getfattr -e hex -n trusted.ec.version 
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
 # file: 
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
 trusted.ec.version=0x0009000b

 [root@glusterfs4 ~]# getfattr -e hex -n trusted.ec.version 
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
 # file: 
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
  trusted.ec.version=0x0009000b

  [root@glusterfs5 glusterfs]# getfattr -e hex -n 
trusted.ec.version 
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
  # file: