I'll send you the emails I sent Pranith with the logs. What causes these disconnects?
David (Sent from mobile) =============================== David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com > On Feb 5, 2015, at 4:55 PM, Ben Turner <btur...@redhat.com> wrote: > > ----- Original Message ----- >> From: "Pranith Kumar Karampuri" <pkara...@redhat.com> >> To: "Xavier Hernandez" <xhernan...@datalab.es>, "David F. Robinson" >> <david.robin...@corvidtec.com>, "Benjamin Turner" >> <bennytu...@gmail.com> >> Cc: gluster-us...@gluster.org, "Gluster Devel" <gluster-devel@gluster.org> >> Sent: Thursday, February 5, 2015 5:30:04 AM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: >>> I believe David already fixed this. I hope this is the same issue he >>> told about permissions issue. >> Oops, it is not. I will take a look. > > Yes David exactly like these: > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > You can 100% verify my theory if you can correlate the time on the > disconnects to the time that the missing files were healed. Can you have a > look at /var/log/glusterfs/glustershd.log? That has all of the healed files > + timestamps, if we can see a disconnect during the rsync and a self heal of > the missing file I think we can safely assume that the disconnects may have > caused this. I'll try this on my test systems, how much data did you rsync? > What size ish of files / an idea of the dir layout? > > @Pranith - Could bricks flapping up and down during the rsync cause the files > to be missing on the first ls(written to 1 subvol but not the other cause it > was down), the ls triggered SH, and thats why the files were there for the > second ls be a possible cause here? > > -b > > >> Pranith >>> >>> Pranith >>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote: >>>> Is the failure repeatable ? with the same directories ? >>>> >>>> It's very weird that the directories appear on the volume when you do >>>> an 'ls' on the bricks. Could it be that you only made a single 'ls' >>>> on fuse mount which not showed the directory ? Is it possible that >>>> this 'ls' triggered a self-heal that repaired the problem, whatever >>>> it was, and when you did another 'ls' on the fuse mount after the >>>> 'ls' on the bricks, the directories were there ? >>>> >>>> The first 'ls' could have healed the files, causing that the >>>> following 'ls' on the bricks showed the files as if nothing were >>>> damaged. If that's the case, it's possible that there were some >>>> disconnections during the copy. >>>> >>>> Added Pranith because he knows better replication and self-heal details. >>>> >>>> Xavi >>>> >>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote: >>>>> Distributed/replicated >>>>> >>>>> Volume Name: homegfs >>>>> Type: Distributed-Replicate >>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>> Status: Started >>>>> Number of Bricks: 4 x 2 = 8 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>> Options Reconfigured: >>>>> performance.io-thread-count: 32 >>>>> performance.cache-size: 128MB >>>>> performance.write-behind-window-size: 128MB >>>>> server.allow-insecure: on >>>>> network.ping-timeout: 10 >>>>> storage.owner-gid: 100 >>>>> geo-replication.indexing: off >>>>> geo-replication.ignore-pid-check: on >>>>> changelog.changelog: on >>>>> changelog.fsync-interval: 3 >>>>> changelog.rollover-time: 15 >>>>> server.manage-gids: on >>>>> >>>>> >>>>> ------ Original Message ------ >>>>> From: "Xavier Hernandez" <xhernan...@datalab.es> >>>>> To: "David F. Robinson" <david.robin...@corvidtec.com>; "Benjamin >>>>> Turner" <bennytu...@gmail.com> >>>>> Cc: "gluster-us...@gluster.org" <gluster-us...@gluster.org>; "Gluster >>>>> Devel" <gluster-devel@gluster.org> >>>>> Sent: 2/4/2015 6:03:45 AM >>>>> Subject: Re: [Gluster-devel] missing files >>>>> >>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote: >>>>>>> Sorry. Thought about this a little more. I should have been clearer. >>>>>>> The files were on both bricks of the replica, not just one side. So, >>>>>>> both bricks had to have been up... The files/directories just >>>>>>> don't show >>>>>>> up on the mount. >>>>>>> I was reading and saw a related bug >>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it >>>>>>> suggested to run: >>>>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \; >>>>>> >>>>>> This command is specific for a dispersed volume. It won't do anything >>>>>> (aside from the error you are seeing) on a replicated volume. >>>>>> >>>>>> I think you are using a replicated volume, right ? >>>>>> >>>>>> In this case I'm not sure what can be happening. Is your volume a pure >>>>>> replicated one or a distributed-replicated ? on a pure replicated it >>>>>> doesn't make sense that some entries do not show in an 'ls' when the >>>>>> file is in both replicas (at least without any error message in the >>>>>> logs). On a distributed-replicated it could be caused by some problem >>>>>> while combining contents of each replica set. >>>>>> >>>>>> What's the configuration of your volume ? >>>>>> >>>>>> Xavi >>>>>> >>>>>>> >>>>>>> I get a bunch of errors for operation not supported: >>>>>>> [root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n >>>>>>> trusted.ec.heal {} \; >>>>>>> find: warning: the -d option is deprecated; please use -depth >>>>>>> instead, >>>>>>> because the latter is a POSIX-compliant feature. >>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not >>>>>>> supported >>>>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: >>>>>>> Operation >>>>>>> not supported >>>>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: >>>>>>> Operation >>>>>>> not supported >>>>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: >>>>>>> Operation >>>>>>> not supported >>>>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: >>>>>>> Operation >>>>>>> not supported >>>>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: >>>>>>> Operation >>>>>>> not supported >>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not >>>>>>> supported >>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported >>>>>>> ------ Original Message ------ >>>>>>> From: "Benjamin Turner" <bennytu...@gmail.com >>>>>>> <mailto:bennytu...@gmail.com>> >>>>>>> To: "David F. Robinson" <david.robin...@corvidtec.com >>>>>>> <mailto:david.robin...@corvidtec.com>> >>>>>>> Cc: "Gluster Devel" <gluster-devel@gluster.org >>>>>>> <mailto:gluster-devel@gluster.org>>; "gluster-us...@gluster.org" >>>>>>> <gluster-us...@gluster.org <mailto:gluster-us...@gluster.org>> >>>>>>> Sent: 2/3/2015 7:12:34 PM >>>>>>> Subject: Re: [Gluster-devel] missing files >>>>>>>> It sounds to me like the files were only copied to one replica, >>>>>>>> werent >>>>>>>> there for the initial for the initial ls which triggered a self >>>>>>>> heal, >>>>>>>> and were there for the last ls because they were healed. Is there >>>>>>>> any >>>>>>>> chance that one of the replicas was down during the rsync? It could >>>>>>>> be that you lost a brick during copy or something like that. To >>>>>>>> confirm I would look for disconnects in the brick logs as well as >>>>>>>> checking glusterfshd.log to verify the missing files were actually >>>>>>>> healed. >>>>>>>> >>>>>>>> -b >>>>>>>> >>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson >>>>>>>> <david.robin...@corvidtec.com <mailto:david.robin...@corvidtec.com>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I rsync'd 20-TB over to my gluster system and noticed that I had >>>>>>>> some directories missing even though the rsync completed >>>>>>>> normally. >>>>>>>> The rsync logs showed that the missing files were transferred. >>>>>>>> I went to the bricks and did an 'ls -al >>>>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. >>>>>>>> After I >>>>>>>> did this 'ls', the files then showed up on the FUSE mounts. >>>>>>>> 1) Why are the files hidden on the fuse mount? >>>>>>>> 2) Why does the ls make them show up on the FUSE mount? >>>>>>>> 3) How can I prevent this from happening again? >>>>>>>> Note, I also mounted the gluster volume using NFS and saw the >>>>>>>> same >>>>>>>> behavior. The files/directories were not shown until I did the >>>>>>>> "ls" on the bricks. >>>>>>>> David >>>>>>>> =============================== >>>>>>>> David F. Robinson, Ph.D. >>>>>>>> President - Corvid Technologies >>>>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office] >>>>>>>> 704.252.1310 <tel:704.252.1310> [cell] >>>>>>>> 704.799.7974 <tel:704.799.7974> [fax] >>>>>>>> david.robin...@corvidtec.com >>>>>>>> <mailto:david.robin...@corvidtec.com> >>>>>>>> http://www.corvidtechnologies.com >>>>>>>> <http://www.corvidtechnologies.com/> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-devel mailing list >>>>>>>> Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org> >>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-devel mailing list >>>>>>> Gluster-devel@gluster.org >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> gluster-us...@gluster.org >>> http://www.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> gluster-us...@gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users >> _______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel