Re: [Gluster-devel] [Gluster-users] missing files
On Sun, Feb 08, 2015 at 01:43:55PM +, Justin Clift wrote: > On 6 Feb 2015, at 20:33, Ben Turner wrote: > > - Original Message - > >> From: "Justin Clift" > >> To: "Benjamin Turner" > >> Cc: "David F. Robinson" , > >> gluster-us...@gluster.org, "Gluster Devel" > >> , "Ben Turner" > >> Sent: Friday, February 6, 2015 3:27:53 PM > >> Subject: Re: [Gluster-devel] [Gluster-users] missing files > >> > >> On 6 Feb 2015, at 02:05, Benjamin Turner wrote: > >>> I think that the multi threaded epoll changes that _just_ landed in master > >>> will help resolve this, but they are so new I haven't been able to test > >>> this. I'll know more when I get a chance to test tomorrow. > >> > >> Which multi-threaded epoll code just landed in master? Are you thinking > >> of this one? > >> > >> http://review.gluster.org/#/c/3842/ > >> > >> If so, it's not in master yet. ;) > > > > Doh! I just saw - "Required patches are all upstream now" and assumed they > > were merged. I have been in class all week so I am not up2date with > > everything. I gave instructions on compiling it from the gerrit patches + > > master so if David wants to give it a go he can. Sorry for the confusion. > > Vijay merged the code into master yesterday, so it should be too long under we > can get some rpms created for people to test with (easily). :) Nightly buils are already available that have this change: http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/ Niels pgpLYOf8J3IKn.pgp Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] missing files
- Original Message - > From: "Ben Turner" > To: "David F. Robinson" > Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > , "Benjamin Turner" > , gluster-us...@gluster.org, "Gluster Devel" > > Sent: Thursday, February 5, 2015 5:22:26 PM > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > - Original Message - > > From: "David F. Robinson" > > To: "Ben Turner" > > Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > > , "Benjamin Turner" > > , gluster-us...@gluster.org, "Gluster Devel" > > > > Sent: Thursday, February 5, 2015 5:01:13 PM > > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > > > I'll send you the emails I sent Pranith with the logs. What causes these > > disconnects? > > Thanks David! Disconnects happen when there are interruption in > communication between peers, normally there is ping timeout that happens. > It could be anything from a flaky NW to the system was to busy to respond > to the pings. My initial take is more towards the ladder as rsync is > absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I > try to keep my writes at least 64KB as in my testing that is the smallest > block size I can write with before perf starts to really drop off. I'll try > something similar in the lab. Ok I do think that the file being self healed is RCA for what you were seeing. Lets look at one of the disconnects: data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 And in the glustershd.log from the gfs01b_glustershd.log file: [2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 [2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 [2015-02-03 20:55:51.465289] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 [2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 [2015-02-03 20:55:55.258548] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 [2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 As you can see the self heal logs are just spammed with files being healed, and I looked at a couple of disconnects and I see self heals getting run shortly after on the bricks that were down. Now we need to find the cause of the disconnects, I am thinking once the disconnects are resolved the files should be properly copied over without SH having to fix things. Like I said I'll give this a go on my lab systems and see if I can repro the disconnects, I'll have time to run through it tomorrow. If in the mean time anyone else has a theory / anything to add here it would be appreciated. -b > -b > > > David (Sent from mobile) > > > > === > > David F. Robinson, Ph.D. > > President - Corvid Technologies > > 704.799.6944 x101 [office] > > 704.252.1310 [cell] > > 704.799.7974 [fax] > > david.robin...@corvidtec.com > > http://www.corvidtechnologies.com > > > > > On Feb 5, 2015, at 4:55 PM, Ben Turner wrote: > > > > > > - Original Message - > > >> From: "Pranith Kumar Karampuri" > > >> To: "Xavier Hernandez" , "David F. Robinson" > > >> , "Benjamin Turner" > > >> > > >> Cc: gluster-us...@gluster.org, "Gluster Devel" > > >> > > >> Sent: Thursday, February 5, 2015 5:30:04 AM > > >> Subject: Re: [Gluster-users] [Gluster-devel] missin
Re: [Gluster-devel] [Gluster-users] missing files
- Original Message - > From: "Justin Clift" > To: "Benjamin Turner" > Cc: "David F. Robinson" , > gluster-us...@gluster.org, "Gluster Devel" > , "Ben Turner" > Sent: Friday, February 6, 2015 3:27:53 PM > Subject: Re: [Gluster-devel] [Gluster-users] missing files > > On 6 Feb 2015, at 02:05, Benjamin Turner wrote: > > I think that the multi threaded epoll changes that _just_ landed in master > > will help resolve this, but they are so new I haven't been able to test > > this. I'll know more when I get a chance to test tomorrow. > > Which multi-threaded epoll code just landed in master? Are you thinking > of this one? > > http://review.gluster.org/#/c/3842/ > > If so, it's not in master yet. ;) Doh! I just saw - "Required patches are all upstream now" and assumed they were merged. I have been in class all week so I am not up2date with everything. I gave instructions on compiling it from the gerrit patches + master so if David wants to give it a go he can. Sorry for the confusion. -b > + Justin > > > > -b > > > > On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson > > wrote: > > Isn't rsync what geo-rep uses? > > > > David (Sent from mobile) > > > > === > > David F. Robinson, Ph.D. > > President - Corvid Technologies > > 704.799.6944 x101 [office] > > 704.252.1310 [cell] > > 704.799.7974 [fax] > > david.robin...@corvidtec.com > > http://www.corvidtechnologies.com > > > > > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > > > > > - Original Message - > > >> From: "Ben Turner" > > >> To: "David F. Robinson" > > >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > > >> , "Benjamin Turner" > > >> , gluster-us...@gluster.org, "Gluster Devel" > > >> > > >> Sent: Thursday, February 5, 2015 5:22:26 PM > > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files > > >> > > >> - Original Message - > > >>> From: "David F. Robinson" > > >>> To: "Ben Turner" > > >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > > >>> , "Benjamin Turner" > > >>> , gluster-us...@gluster.org, "Gluster Devel" > > >>> > > >>> Sent: Thursday, February 5, 2015 5:01:13 PM > > >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files > > >>> > > >>> I'll send you the emails I sent Pranith with the logs. What causes > > >>> these > > >>> disconnects? > > >> > > >> Thanks David! Disconnects happen when there are interruption in > > >> communication between peers, normally there is ping timeout that > > >> happens. > > >> It could be anything from a flaky NW to the system was to busy to > > >> respond > > >> to the pings. My initial take is more towards the ladder as rsync is > > >> absolutely the worst use case for gluster - IIRC it writes in 4kb > > >> blocks. I > > >> try to keep my writes at least 64KB as in my testing that is the > > >> smallest > > >> block size I can write with before perf starts to really drop off. I'll > > >> try > > >> something similar in the lab. > > > > > > Ok I do think that the file being self healed is RCA for what you were > > > seeing. Lets look at one of the disconnects: > > > > > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > > connection from > > > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > > > > > And in the glustershd.log from the gfs01b_glustershd.log file: > > > > > > [2015-02-03 20:55:48.001797] I > > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > > > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > > > [2015-02-03 20:55:49.341996] I > > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. > > > source=1 sinks=0 > > > [2015-02-03 20:55:49.343093] I > > > [afr-self-heal-entry.c:55
Re: [Gluster-devel] [Gluster-users] missing files
- Original Message - > From: "Pranith Kumar Karampuri" > To: "Xavier Hernandez" , "David F. Robinson" > , "Benjamin Turner" > > Cc: gluster-us...@gluster.org, "Gluster Devel" > Sent: Thursday, February 5, 2015 5:30:04 AM > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > > On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: > > I believe David already fixed this. I hope this is the same issue he > > told about permissions issue. > Oops, it is not. I will take a look. Yes David exactly like these: data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 You can 100% verify my theory if you can correlate the time on the disconnects to the time that the missing files were healed. Can you have a look at /var/log/glusterfs/glustershd.log? That has all of the healed files + timestamps, if we can see a disconnect during the rsync and a self heal of the missing file I think we can safely assume that the disconnects may have caused this. I'll try this on my test systems, how much data did you rsync? What size ish of files / an idea of the dir layout? @Pranith - Could bricks flapping up and down during the rsync cause the files to be missing on the first ls(written to 1 subvol but not the other cause it was down), the ls triggered SH, and thats why the files were there for the second ls be a possible cause here? -b > Pranith > > > > Pranith > > On 02/05/2015 03:44 PM, Xavier Hernandez wrote: > >> Is the failure repeatable ? with the same directories ? > >> > >> It's very weird that the directories appear on the volume when you do > >> an 'ls' on the bricks. Could it be that you only made a single 'ls' > >> on fuse mount which not showed the directory ? Is it possible that > >> this 'ls' triggered a self-heal that repaired the problem, whatever > >> it was, and when you did another 'ls' on the fuse mount after the > >> 'ls' on the bricks, the directories were there ? > >> > >> The first 'ls' could have healed the files, causing that the > >> following 'ls' on the bricks showed the files as if nothing were > >> damaged. If that's the case, it's possible that there were some > >> disconnections during the copy. > >> > >> Added Pranith because he knows better replication and self-heal details. > >> > >> Xavi > >> > >> On 02/04/2015 07:23 PM, David F. Robinson wrote: > >>> Distributed/replicated > >>> > >>> Volume Name: homegfs > >>> Type: Distributed-Replicate > >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 > >>> Status: Started > >>> Number of Bricks: 4 x 2 = 8 > >>> Transport-type: tcp > >>> Bricks: > >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs > >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs > >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs > >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs > >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs > >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs > >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs > >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs > >>> Options Reconfigured: > >>> performance.io-thread-count: 32 > >>> performance.cache-size: 128MB > >>> performance.write-behind-window-size: 128MB > >>> server.allow-insecure: on > >>> network.ping-timeout: 10 > >>> storage.owner-gid: 100 > >>> geo-replication.indexing: off > >>> geo-replication.ignore-pid-check: on > >>> changelog.changelog: on > >>> changelog.fsync-interval: 3 > >>> changelog.rollover-time: 15 > >>> server.manage-gids: on > >>> > >>> > >>> -- Original Message -- > >>> From: "Xavier Hernandez" > >>> To: "David F. Robinson" ; "Benjamin > >>> Turner" > >>> Cc: "gluster-us...@gluster.org" ; "Gluster > >>> Devel" > >>> Sent: 2/4/2015 6:03:45 AM > >>> Subject: Re: [Gluster-devel] missing files > >>> > On 02/04/2015 01:30 AM, David F. Robinson wrote: > > Sorry. Thought about this a little more. I should have been clearer. > > The files were on both bricks o
Re: [Gluster-devel] [Gluster-users] missing files
- Original Message - > From: "David F. Robinson" > To: "Ben Turner" > Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > , "Benjamin Turner" > , gluster-us...@gluster.org, "Gluster Devel" > > Sent: Thursday, February 5, 2015 5:01:13 PM > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > I'll send you the emails I sent Pranith with the logs. What causes these > disconnects? Thanks David! Disconnects happen when there are interruption in communication between peers, normally there is ping timeout that happens. It could be anything from a flaky NW to the system was to busy to respond to the pings. My initial take is more towards the ladder as rsync is absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I try to keep my writes at least 64KB as in my testing that is the smallest block size I can write with before perf starts to really drop off. I'll try something similar in the lab. -b > David (Sent from mobile) > > === > David F. Robinson, Ph.D. > President - Corvid Technologies > 704.799.6944 x101 [office] > 704.252.1310 [cell] > 704.799.7974 [fax] > david.robin...@corvidtec.com > http://www.corvidtechnologies.com > > > On Feb 5, 2015, at 4:55 PM, Ben Turner wrote: > > > > - Original Message - > >> From: "Pranith Kumar Karampuri" > >> To: "Xavier Hernandez" , "David F. Robinson" > >> , "Benjamin Turner" > >> > >> Cc: gluster-us...@gluster.org, "Gluster Devel" > >> Sent: Thursday, February 5, 2015 5:30:04 AM > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >> > >> > >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: > >>> I believe David already fixed this. I hope this is the same issue he > >>> told about permissions issue. > >> Oops, it is not. I will take a look. > > > > Yes David exactly like these: > > > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > connection from > > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 > > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > connection from > > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 > > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > connection from > > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 > > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > connection from > > gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > connection from > > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > > > You can 100% verify my theory if you can correlate the time on the > > disconnects to the time that the missing files were healed. Can you have > > a look at /var/log/glusterfs/glustershd.log? That has all of the healed > > files + timestamps, if we can see a disconnect during the rsync and a self > > heal of the missing file I think we can safely assume that the disconnects > > may have caused this. I'll try this on my test systems, how much data did > > you rsync? What size ish of files / an idea of the dir layout? > > > > @Pranith - Could bricks flapping up and down during the rsync cause the > > files to be missing on the first ls(written to 1 subvol but not the other > > cause it was down), the ls triggered SH, and thats why the files were > > there for the second ls be a possible cause here? > > > > -b > > > > > >> Pranith > >>> > >>> Pranith > On 02/05/2015 03:44 PM, Xavier Hernandez wrote: > Is the failure repeatable ? with the same directories ? > > It's very weird that the directories appear on the volume when you do > an 'ls' on the bricks. Could it be that you only made a single 'ls' > on fuse mount which not showed the directory ? Is it possible that > this 'ls' triggered a self-heal that repaired the problem, whatever > it was, and when you did another 'ls' on the fuse mount after the > 'ls' on the bricks, the directories were there ? > > The first 'ls' could have healed the files, causing that the > following 'ls' on the bricks showed the files as if nothing were > damaged. If that's the case, it's possible that there were some > disconnections during the copy. > > Added Pranith because he knows better replication and self-heal details. > > Xavi > > > On 02/04/2015 07:23 PM, David F. Robinson wrote: > > Distributed/replicated > > > > Volume Name: homegfs > > Type: Distributed-Replicate > >
Re: [Gluster-devel] [Gluster-users] missing files
On 6 Feb 2015, at 20:33, Ben Turner wrote: > - Original Message - >> From: "Justin Clift" >> To: "Benjamin Turner" >> Cc: "David F. Robinson" , >> gluster-us...@gluster.org, "Gluster Devel" >> , "Ben Turner" >> Sent: Friday, February 6, 2015 3:27:53 PM >> Subject: Re: [Gluster-devel] [Gluster-users] missing files >> >> On 6 Feb 2015, at 02:05, Benjamin Turner wrote: >>> I think that the multi threaded epoll changes that _just_ landed in master >>> will help resolve this, but they are so new I haven't been able to test >>> this. I'll know more when I get a chance to test tomorrow. >> >> Which multi-threaded epoll code just landed in master? Are you thinking >> of this one? >> >> http://review.gluster.org/#/c/3842/ >> >> If so, it's not in master yet. ;) > > Doh! I just saw - "Required patches are all upstream now" and assumed they > were merged. I have been in class all week so I am not up2date with > everything. I gave instructions on compiling it from the gerrit patches + > master so if David wants to give it a go he can. Sorry for the confusion. Vijay merged the code into master yesterday, so it should be too long under we can get some rpms created for people to test with (easily). :) + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] missing files
I don't think I understood what you sent enough to give it a try. I'll wait until it comes out in a beta or release version. David -- Original Message -- From: "Ben Turner" To: "Justin Clift" ; "David F. Robinson" Cc: "Benjamin Turner" ; gluster-us...@gluster.org; "Gluster Devel" Sent: 2/6/2015 3:33:42 PM Subject: Re: [Gluster-devel] [Gluster-users] missing files - Original Message - From: "Justin Clift" To: "Benjamin Turner" Cc: "David F. Robinson" , gluster-us...@gluster.org, "Gluster Devel" , "Ben Turner" Sent: Friday, February 6, 2015 3:27:53 PM Subject: Re: [Gluster-devel] [Gluster-users] missing files On 6 Feb 2015, at 02:05, Benjamin Turner wrote: > I think that the multi threaded epoll changes that _just_ landed in master > will help resolve this, but they are so new I haven't been able to test > this. I'll know more when I get a chance to test tomorrow. Which multi-threaded epoll code just landed in master? Are you thinking of this one? http://review.gluster.org/#/c/3842/ If so, it's not in master yet. ;) Doh! I just saw - "Required patches are all upstream now" and assumed they were merged. I have been in class all week so I am not up2date with everything. I gave instructions on compiling it from the gerrit patches + master so if David wants to give it a go he can. Sorry for the confusion. -b + Justin > -b > > On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson > wrote: > Isn't rsync what geo-rep uses? > > David (Sent from mobile) > > === > David F. Robinson, Ph.D. > President - Corvid Technologies > 704.799.6944 x101 [office] > 704.252.1310 [cell] > 704.799.7974 [fax] > david.robin...@corvidtec.com > http://www.corvidtechnologies.com > > > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > > > - Original Message - > >> From: "Ben Turner" > >> To: "David F. Robinson" > >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > >> , "Benjamin Turner" > >> , gluster-us...@gluster.org, "Gluster Devel" > >> > >> Sent: Thursday, February 5, 2015 5:22:26 PM > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >> > >> - Original Message - > >>> From: "David F. Robinson" > >>> To: "Ben Turner" > >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > >>> , "Benjamin Turner" > >>> , gluster-us...@gluster.org, "Gluster Devel" > >>> > >>> Sent: Thursday, February 5, 2015 5:01:13 PM > >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >>> > >>> I'll send you the emails I sent Pranith with the logs. What causes > >>> these > >>> disconnects? > >> > >> Thanks David! Disconnects happen when there are interruption in > >> communication between peers, normally there is ping timeout that > >> happens. > >> It could be anything from a flaky NW to the system was to busy to > >> respond > >> to the pings. My initial take is more towards the ladder as rsync is > >> absolutely the worst use case for gluster - IIRC it writes in 4kb > >> blocks. I > >> try to keep my writes at least 64KB as in my testing that is the > >> smallest > >> block size I can write with before perf starts to really drop off. I'll > >> try > >> something similar in the lab. > > > > Ok I do think that the file being self healed is RCA for what you were > > seeing. Lets look at one of the disconnects: > > > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > connection from > > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > > > And in the glustershd.log from the gfs01b_glustershd.log file: > > > > [2015-02-03 20:55:48.001797] I > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > > [2015-02-03 20:55:49.341996] I > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. > > source=1 sinks=0 > > [2015-02-
Re: [Gluster-devel] [Gluster-users] missing files
On 6 Feb 2015, at 02:05, Benjamin Turner wrote: > I think that the multi threaded epoll changes that _just_ landed in master > will help resolve this, but they are so new I haven't been able to test this. > I'll know more when I get a chance to test tomorrow. Which multi-threaded epoll code just landed in master? Are you thinking of this one? http://review.gluster.org/#/c/3842/ If so, it's not in master yet. ;) + Justin > -b > > On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson > wrote: > Isn't rsync what geo-rep uses? > > David (Sent from mobile) > > === > David F. Robinson, Ph.D. > President - Corvid Technologies > 704.799.6944 x101 [office] > 704.252.1310 [cell] > 704.799.7974 [fax] > david.robin...@corvidtec.com > http://www.corvidtechnologies.com > > > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > > > - Original Message - > >> From: "Ben Turner" > >> To: "David F. Robinson" > >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > >> , "Benjamin Turner" > >> , gluster-us...@gluster.org, "Gluster Devel" > >> > >> Sent: Thursday, February 5, 2015 5:22:26 PM > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >> > >> - Original Message - > >>> From: "David F. Robinson" > >>> To: "Ben Turner" > >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" > >>> , "Benjamin Turner" > >>> , gluster-us...@gluster.org, "Gluster Devel" > >>> > >>> Sent: Thursday, February 5, 2015 5:01:13 PM > >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >>> > >>> I'll send you the emails I sent Pranith with the logs. What causes these > >>> disconnects? > >> > >> Thanks David! Disconnects happen when there are interruption in > >> communication between peers, normally there is ping timeout that happens. > >> It could be anything from a flaky NW to the system was to busy to respond > >> to the pings. My initial take is more towards the ladder as rsync is > >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. > >> I > >> try to keep my writes at least 64KB as in my testing that is the smallest > >> block size I can write with before perf starts to really drop off. I'll > >> try > >> something similar in the lab. > > > > Ok I do think that the file being self healed is RCA for what you were > > seeing. Lets look at one of the disconnects: > > > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > > from > > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > > > And in the glustershd.log from the gfs01b_glustershd.log file: > > > > [2015-02-03 20:55:48.001797] I > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > > [2015-02-03 20:55:49.341996] I > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 > > sinks=0 > > [2015-02-03 20:55:49.343093] I > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > > performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > > [2015-02-03 20:55:50.463652] I > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 > > sinks=0 > > [2015-02-03 20:55:51.465289] I > > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > > 0-homegfs-replicate-0: performing metadata selfheal on > > 403e661a-1c27-4e79-9867-c0572aba2b3c > > [2015-02-03 20:55:51.466515] I > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. > > source=1 sinks=0 > > [2015-02-03 20:55:51.467098] I > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > > performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > > [2015-02-03 20:55:55.257808] I > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 > > sinks=0 > > [2015-02-03 20:55:55.258548] I > > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > > 0-homegfs-replicate-0: performing metadata selfheal on > > c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > > [2015-02-03 20:55:55.259367] I > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > > Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. > > source=1 sinks=0 > > [2015-02-03 20:55:55.259980] I > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > > performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > > > > As you can see the self heal logs are just spammed with files being healed, > > and I looked at a couple of disconnects and I see
Re: [Gluster-devel] [Gluster-users] missing files
copy that. Thanks for looking into the issue. David -- Original Message -- From: "Benjamin Turner" To: "David F. Robinson" Cc: "Ben Turner" ; "Pranith Kumar Karampuri" ; "Xavier Hernandez" ; "gluster-us...@gluster.org" ; "Gluster Devel" Sent: 2/5/2015 9:05:43 PM Subject: Re: [Gluster-users] [Gluster-devel] missing files Correct! I have seen(back in the day, its been 3ish years since I have seen it) having say 50+ volumes each with a geo rep session take system load levels to the point where pings couldn't be serviced within the ping timeout. So it is known to happen but there has been alot of work in the geo rep space to help here, some of which is discussed: https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50 (think tar + ssh and other fixes)Your symptoms remind me of that case of 50+ geo repd volumes, thats why I mentioned it from the start. My current shoot from the hip theory is when rsyncing all that data the servers got too busy to service the pings and it lead to disconnects. This is common across all of the clustering / distributed software I have worked on, if the system gets too busy to service heartbeat within the timeout things go crazy(think fork bomb on a single host). Now this could be a case of me putting symptoms from an old issue into what you are describing, but thats where my head is at. If I'm correct I should be able to repro using a similar workload. I think that the multi threaded epoll changes that _just_ landed in master will help resolve this, but they are so new I haven't been able to test this. I'll know more when I get a chance to test tomorrow. -b On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson wrote: Isn't rsync what geo-rep uses? David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > - Original Message - >> From: "Ben Turner" >> To: "David F. Robinson" >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" , "Benjamin Turner" >> , gluster-us...@gluster.org, "Gluster Devel" >> Sent: Thursday, February 5, 2015 5:22:26 PM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> - Original Message - >>> From: "David F. Robinson" >>> To: "Ben Turner" >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" >>> , "Benjamin Turner" >>> , gluster-us...@gluster.org, "Gluster Devel" >>> >>> Sent: Thursday, February 5, 2015 5:01:13 PM >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>> >>> I'll send you the emails I sent Pranith with the logs. What causes these >>> disconnects? >> >> Thanks David! Disconnects happen when there are interruption in >> communication between peers, normally there is ping timeout that happens. >> It could be anything from a flaky NW to the system was to busy to respond >> to the pings. My initial take is more towards the ladder as rsync is >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I >> try to keep my writes at least 64KB as in my testing that is the smallest >> block size I can write with before perf starts to really drop off. I'll try >> something similar in the lab. > > Ok I do think that the file being self healed is RCA for what you were seeing. Lets look at one of the disconnects: > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > And in the glustershd.log from the gfs01b_glustershd.log file: > > [2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 > [2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 > [2015-02-03 20:55:51.465289] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_
Re: [Gluster-devel] [Gluster-users] missing files
Correct! I have seen(back in the day, its been 3ish years since I have seen it) having say 50+ volumes each with a geo rep session take system load levels to the point where pings couldn't be serviced within the ping timeout. So it is known to happen but there has been alot of work in the geo rep space to help here, some of which is discussed: https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50 (think tar + ssh and other fixes)Your symptoms remind me of that case of 50+ geo repd volumes, thats why I mentioned it from the start. My current shoot from the hip theory is when rsyncing all that data the servers got too busy to service the pings and it lead to disconnects. This is common across all of the clustering / distributed software I have worked on, if the system gets too busy to service heartbeat within the timeout things go crazy(think fork bomb on a single host). Now this could be a case of me putting symptoms from an old issue into what you are describing, but thats where my head is at. If I'm correct I should be able to repro using a similar workload. I think that the multi threaded epoll changes that _just_ landed in master will help resolve this, but they are so new I haven't been able to test this. I'll know more when I get a chance to test tomorrow. -b On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson < david.robin...@corvidtec.com> wrote: > Isn't rsync what geo-rep uses? > > David (Sent from mobile) > > === > David F. Robinson, Ph.D. > President - Corvid Technologies > 704.799.6944 x101 [office] > 704.252.1310 [cell] > 704.799.7974 [fax] > david.robin...@corvidtec.com > http://www.corvidtechnologies.com > > > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > > > - Original Message - > >> From: "Ben Turner" > >> To: "David F. Robinson" > >> Cc: "Pranith Kumar Karampuri" , "Xavier > Hernandez" , "Benjamin Turner" > >> , gluster-us...@gluster.org, "Gluster Devel" < > gluster-devel@gluster.org> > >> Sent: Thursday, February 5, 2015 5:22:26 PM > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >> > >> - Original Message - > >>> From: "David F. Robinson" > >>> To: "Ben Turner" > >>> Cc: "Pranith Kumar Karampuri" , "Xavier > Hernandez" > >>> , "Benjamin Turner" > >>> , gluster-us...@gluster.org, "Gluster Devel" > >>> > >>> Sent: Thursday, February 5, 2015 5:01:13 PM > >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files > >>> > >>> I'll send you the emails I sent Pranith with the logs. What causes > these > >>> disconnects? > >> > >> Thanks David! Disconnects happen when there are interruption in > >> communication between peers, normally there is ping timeout that > happens. > >> It could be anything from a flaky NW to the system was to busy to > respond > >> to the pings. My initial take is more towards the ladder as rsync is > >> absolutely the worst use case for gluster - IIRC it writes in 4kb > blocks. I > >> try to keep my writes at least 64KB as in my testing that is the > smallest > >> block size I can write with before perf starts to really drop off. > I'll try > >> something similar in the lab. > > > > Ok I do think that the file being self healed is RCA for what you were > seeing. Lets look at one of the disconnects: > > > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > > > And in the glustershd.log from the gfs01b_glustershd.log file: > > > > [2015-02-03 20:55:48.001797] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > > [2015-02-03 20:55:49.341996] I > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 > sinks=0 > > [2015-02-03 20:55:49.343093] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > > [2015-02-03 20:55:50.463652] I > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 > sinks=0 > > [2015-02-03 20:55:51.465289] I > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > 0-homegfs-replicate-0: performing metadata selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c > > [2015-02-03 20:55:51.466515] I > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: > Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. > source=1 sinks=0 > > [2015-02-03 20:55:51.467098] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > > [2015-02-03 20:55:55.257808] I > [afr-self-heal-common.c:476:afr_
Re: [Gluster-devel] [Gluster-users] missing files
- Original Message - > From: "Ben Turner" > To: "Pranith Kumar Karampuri" , "David F. Robinson" > > Cc: "Xavier Hernandez" , "Benjamin Turner" > , gluster-us...@gluster.org, > "Gluster Devel" > Sent: Friday, February 6, 2015 3:25:28 AM > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > - Original Message - > > From: "Pranith Kumar Karampuri" > > To: "Xavier Hernandez" , "David F. Robinson" > > , "Benjamin Turner" > > > > Cc: gluster-us...@gluster.org, "Gluster Devel" > > Sent: Thursday, February 5, 2015 5:30:04 AM > > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > > > > > On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: > > > I believe David already fixed this. I hope this is the same issue he > > > told about permissions issue. > > Oops, it is not. I will take a look. > > Yes David exactly like these: > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > You can 100% verify my theory if you can correlate the time on the > disconnects to the time that the missing files were healed. Can you have a > look at /var/log/glusterfs/glustershd.log? That has all of the healed files > + timestamps, if we can see a disconnect during the rsync and a self heal of > the missing file I think we can safely assume that the disconnects may have > caused this. I'll try this on my test systems, how much data did you rsync? > What size ish of files / an idea of the dir layout? > > @Pranith - Could bricks flapping up and down during the rsync cause the files > to be missing on the first ls(written to 1 subvol but not the other cause it > was down), the ls triggered SH, and thats why the files were there for the > second ls be a possible cause here? No it would be a bug. Afr should serve the directory contents from the brick with those files. > > -b > > > > Pranith > > > > > > Pranith > > > On 02/05/2015 03:44 PM, Xavier Hernandez wrote: > > >> Is the failure repeatable ? with the same directories ? > > >> > > >> It's very weird that the directories appear on the volume when you do > > >> an 'ls' on the bricks. Could it be that you only made a single 'ls' > > >> on fuse mount which not showed the directory ? Is it possible that > > >> this 'ls' triggered a self-heal that repaired the problem, whatever > > >> it was, and when you did another 'ls' on the fuse mount after the > > >> 'ls' on the bricks, the directories were there ? > > >> > > >> The first 'ls' could have healed the files, causing that the > > >> following 'ls' on the bricks showed the files as if nothing were > > >> damaged. If that's the case, it's possible that there were some > > >> disconnections during the copy. > > >> > > >> Added Pranith because he knows better replication and self-heal details. > > >> > > >> Xavi > > >> > > >> On 02/04/2015 07:23 PM, David F. Robinson wrote: > > >>> Distributed/replicated > > >>> > > >>> Volume Name: homegfs > > >>> Type: Distributed-Replicate > > >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 > > >>> Status: Started > > >>> Number of Bricks: 4 x 2 = 8 > > >>> Transport-type: tcp > > >>> Bricks: > > >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs > > >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs > > >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs > > >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs > > >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs > > >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs > > >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs > > >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs > > >>> Options Reconfigured: > > >>> performance.io-thread-count: 32 > > >>> performance.cache-size: 128MB > > >>> performance.write-behind-window-size: 128MB > > >>> server.allow-insecure: on > > >>> network.ping-timeout: 10 > > >>> storage.owner-gid: 100 > > >>> geo-replication.indexing: off > > >>> geo-replication.ignore-pid-c
Re: [Gluster-devel] [Gluster-users] missing files
Isn't rsync what geo-rep uses? David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > - Original Message - >> From: "Ben Turner" >> To: "David F. Robinson" >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" >> , "Benjamin Turner" >> , gluster-us...@gluster.org, "Gluster Devel" >> >> Sent: Thursday, February 5, 2015 5:22:26 PM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> - Original Message - >>> From: "David F. Robinson" >>> To: "Ben Turner" >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" >>> , "Benjamin Turner" >>> , gluster-us...@gluster.org, "Gluster Devel" >>> >>> Sent: Thursday, February 5, 2015 5:01:13 PM >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>> >>> I'll send you the emails I sent Pranith with the logs. What causes these >>> disconnects? >> >> Thanks David! Disconnects happen when there are interruption in >> communication between peers, normally there is ping timeout that happens. >> It could be anything from a flaky NW to the system was to busy to respond >> to the pings. My initial take is more towards the ladder as rsync is >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I >> try to keep my writes at least 64KB as in my testing that is the smallest >> block size I can write with before perf starts to really drop off. I'll try >> something similar in the lab. > > Ok I do think that the file being self healed is RCA for what you were > seeing. Lets look at one of the disconnects: > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > And in the glustershd.log from the gfs01b_glustershd.log file: > > [2015-02-03 20:55:48.001797] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed entry selfheal on > 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 > [2015-02-03 20:55:49.343093] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed entry selfheal on > 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 > [2015-02-03 20:55:51.465289] I > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > 0-homegfs-replicate-0: performing metadata selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed metadata selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:51.467098] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed entry selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:55.258548] I > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > 0-homegfs-replicate-0: performing metadata selfheal on > c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed metadata selfheal on > c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 > [2015-02-03 20:55:55.259980] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > > As you can see the self heal logs are just spammed with files being healed, > and I looked at a couple of disconnects and I see self heals getting run > shortly after on the bricks that were down. Now we need to find the cause of > the disconnects, I am thinking once the disconnects are resolved the files > should be properly copied over without SH having to fix things. Like I said > I'll give this a go on my lab systems and see if I can repro the disconnects, > I'll have time to run through it tomorrow. If in the mean time anyone else > has a theory / anything to add here it would be appreciated. > > -b > >> -b >> >>> David (Sent from mobile) >>> >>> === >>> David F. Robinson, Ph.D. >>> President - Corvid Technologies >>> 704.799.6944 x101 [office] >>> 704.252.1310 [cell] >>> 704.799.7974
Re: [Gluster-devel] [Gluster-users] missing files
Out of curiosity, are you using --inplace? On 02/05/2015 02:59 PM, David F. Robinson wrote: Should I run my rsync with --block-size = something other than the default? Is there an optimal value? I think 128k is the max from my quick search. Didn't dig into it throughly though. David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: - Original Message - From: "Ben Turner" To: "David F. Robinson" Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" , "Benjamin Turner" , gluster-us...@gluster.org, "Gluster Devel" Sent: Thursday, February 5, 2015 5:22:26 PM Subject: Re: [Gluster-users] [Gluster-devel] missing files - Original Message - From: "David F. Robinson" To: "Ben Turner" Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" , "Benjamin Turner" , gluster-us...@gluster.org, "Gluster Devel" Sent: Thursday, February 5, 2015 5:01:13 PM Subject: Re: [Gluster-users] [Gluster-devel] missing files I'll send you the emails I sent Pranith with the logs. What causes these disconnects? Thanks David! Disconnects happen when there are interruption in communication between peers, normally there is ping timeout that happens. It could be anything from a flaky NW to the system was to busy to respond to the pings. My initial take is more towards the ladder as rsync is absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I try to keep my writes at least 64KB as in my testing that is the smallest block size I can write with before perf starts to really drop off. I'll try something similar in the lab. Ok I do think that the file being self healed is RCA for what you were seeing. Lets look at one of the disconnects: data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 And in the glustershd.log from the gfs01b_glustershd.log file: [2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 [2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 [2015-02-03 20:55:51.465289] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 [2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 [2015-02-03 20:55:55.258548] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 [2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 As you can see the self heal logs are just spammed with files being healed, and I looked at a couple of disconnects and I see self heals getting run shortly after on the bricks that were down. Now we need to find the cause of the disconnects, I am thinking once the disconnects are resolved the files should be properly copied over without SH having to fix things. Like I said I'll give this a go on my lab systems and see if I can repro the disconnects, I'll have time to run through it tomorrow. If in the mean time anyone else has a theory / anything to add here it would be appreciated. -b -b David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] d
Re: [Gluster-devel] [Gluster-users] missing files
Should I run my rsync with --block-size = something other than the default? Is there an optimal value? I think 128k is the max from my quick search. Didn't dig into it throughly though. David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com > On Feb 5, 2015, at 5:41 PM, Ben Turner wrote: > > - Original Message - >> From: "Ben Turner" >> To: "David F. Robinson" >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" >> , "Benjamin Turner" >> , gluster-us...@gluster.org, "Gluster Devel" >> >> Sent: Thursday, February 5, 2015 5:22:26 PM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> - Original Message - >>> From: "David F. Robinson" >>> To: "Ben Turner" >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" >>> , "Benjamin Turner" >>> , gluster-us...@gluster.org, "Gluster Devel" >>> >>> Sent: Thursday, February 5, 2015 5:01:13 PM >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>> >>> I'll send you the emails I sent Pranith with the logs. What causes these >>> disconnects? >> >> Thanks David! Disconnects happen when there are interruption in >> communication between peers, normally there is ping timeout that happens. >> It could be anything from a flaky NW to the system was to busy to respond >> to the pings. My initial take is more towards the ladder as rsync is >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I >> try to keep my writes at least 64KB as in my testing that is the smallest >> block size I can write with before perf starts to really drop off. I'll try >> something similar in the lab. > > Ok I do think that the file being self healed is RCA for what you were > seeing. Lets look at one of the disconnects: > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > And in the glustershd.log from the gfs01b_glustershd.log file: > > [2015-02-03 20:55:48.001797] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed entry selfheal on > 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 > [2015-02-03 20:55:49.343093] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed entry selfheal on > 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 > [2015-02-03 20:55:51.465289] I > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > 0-homegfs-replicate-0: performing metadata selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed metadata selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:51.467098] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed entry selfheal on > 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:55.258548] I > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] > 0-homegfs-replicate-0: performing metadata selfheal on > c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] > 0-homegfs-replicate-0: Completed metadata selfheal on > c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 > [2015-02-03 20:55:55.259980] I > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: > performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > > As you can see the self heal logs are just spammed with files being healed, > and I looked at a couple of disconnects and I see self heals getting run > shortly after on the bricks that were down. Now we need to find the cause of > the disconnects, I am thinking once the disconnects are resolved the files > should be properly copied over without SH having to fix things. Like I said > I'll give this a go on my lab systems and see if I can repro the disconnects, > I'll have time to run through it tomorrow. If in the mean time anyone else > has a theory / anything to add here it would be appreciated. > > -b > >> -b >> >>> David (Sent from mobile) >>> >>> ==
Re: [Gluster-devel] [Gluster-users] missing files
It was a mix of files from very small to very large. And many terabytes of data. Approx 20tb David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com > On Feb 5, 2015, at 4:55 PM, Ben Turner wrote: > > - Original Message - >> From: "Pranith Kumar Karampuri" >> To: "Xavier Hernandez" , "David F. Robinson" >> , "Benjamin Turner" >> >> Cc: gluster-us...@gluster.org, "Gluster Devel" >> Sent: Thursday, February 5, 2015 5:30:04 AM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: >>> I believe David already fixed this. I hope this is the same issue he >>> told about permissions issue. >> Oops, it is not. I will take a look. > > Yes David exactly like these: > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > You can 100% verify my theory if you can correlate the time on the > disconnects to the time that the missing files were healed. Can you have a > look at /var/log/glusterfs/glustershd.log? That has all of the healed files > + timestamps, if we can see a disconnect during the rsync and a self heal of > the missing file I think we can safely assume that the disconnects may have > caused this. I'll try this on my test systems, how much data did you rsync? > What size ish of files / an idea of the dir layout? > > @Pranith - Could bricks flapping up and down during the rsync cause the files > to be missing on the first ls(written to 1 subvol but not the other cause it > was down), the ls triggered SH, and thats why the files were there for the > second ls be a possible cause here? > > -b > > >> Pranith >>> >>> Pranith On 02/05/2015 03:44 PM, Xavier Hernandez wrote: Is the failure repeatable ? with the same directories ? It's very weird that the directories appear on the volume when you do an 'ls' on the bricks. Could it be that you only made a single 'ls' on fuse mount which not showed the directory ? Is it possible that this 'ls' triggered a self-heal that repaired the problem, whatever it was, and when you did another 'ls' on the fuse mount after the 'ls' on the bricks, the directories were there ? The first 'ls' could have healed the files, causing that the following 'ls' on the bricks showed the files as if nothing were damaged. If that's the case, it's possible that there were some disconnections during the copy. Added Pranith because he knows better replication and self-heal details. Xavi > On 02/04/2015 07:23 PM, David F. Robinson wrote: > Distributed/replicated > > Volume Name: homegfs > Type: Distributed-Replicate > Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 > Status: Started > Number of Bricks: 4 x 2 = 8 > Transport-type: tcp > Bricks: > Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs > Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs > Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs > Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs > Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs > Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs > Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs > Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs > Options Reconfigured: > performance.io-thread-count: 32 > performance.cache-size: 128MB > performance.write-behind-window-size: 128MB > server.allow-insecure: on > network.ping-timeout: 10 > storage.owner-gid: 100 > geo-replication.indexing: off > geo-replication.ignore-pid-check: on > changelog.changelog: on > changelog.fsync-interval: 3 > changelog.r
Re: [Gluster-devel] [Gluster-users] missing files
I'll send you the emails I sent Pranith with the logs. What causes these disconnects? David (Sent from mobile) === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com http://www.corvidtechnologies.com > On Feb 5, 2015, at 4:55 PM, Ben Turner wrote: > > - Original Message - >> From: "Pranith Kumar Karampuri" >> To: "Xavier Hernandez" , "David F. Robinson" >> , "Benjamin Turner" >> >> Cc: gluster-us...@gluster.org, "Gluster Devel" >> Sent: Thursday, February 5, 2015 5:30:04 AM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: >>> I believe David already fixed this. I hope this is the same issue he >>> told about permissions issue. >> Oops, it is not. I will take a look. > > Yes David exactly like these: > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection > from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > You can 100% verify my theory if you can correlate the time on the > disconnects to the time that the missing files were healed. Can you have a > look at /var/log/glusterfs/glustershd.log? That has all of the healed files > + timestamps, if we can see a disconnect during the rsync and a self heal of > the missing file I think we can safely assume that the disconnects may have > caused this. I'll try this on my test systems, how much data did you rsync? > What size ish of files / an idea of the dir layout? > > @Pranith - Could bricks flapping up and down during the rsync cause the files > to be missing on the first ls(written to 1 subvol but not the other cause it > was down), the ls triggered SH, and thats why the files were there for the > second ls be a possible cause here? > > -b > > >> Pranith >>> >>> Pranith On 02/05/2015 03:44 PM, Xavier Hernandez wrote: Is the failure repeatable ? with the same directories ? It's very weird that the directories appear on the volume when you do an 'ls' on the bricks. Could it be that you only made a single 'ls' on fuse mount which not showed the directory ? Is it possible that this 'ls' triggered a self-heal that repaired the problem, whatever it was, and when you did another 'ls' on the fuse mount after the 'ls' on the bricks, the directories were there ? The first 'ls' could have healed the files, causing that the following 'ls' on the bricks showed the files as if nothing were damaged. If that's the case, it's possible that there were some disconnections during the copy. Added Pranith because he knows better replication and self-heal details. Xavi > On 02/04/2015 07:23 PM, David F. Robinson wrote: > Distributed/replicated > > Volume Name: homegfs > Type: Distributed-Replicate > Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 > Status: Started > Number of Bricks: 4 x 2 = 8 > Transport-type: tcp > Bricks: > Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs > Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs > Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs > Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs > Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs > Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs > Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs > Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs > Options Reconfigured: > performance.io-thread-count: 32 > performance.cache-size: 128MB > performance.write-behind-window-size: 128MB > server.allow-insecure: on > network.ping-timeout: 10 > storage.owner-gid: 100 > geo-replication.indexing: off > geo-replication.ignore-pid-check: on > changelog.changelog: on > changelog.fsync-interval: 3 > changelog.rollover
Re: [Gluster-devel] [Gluster-users] missing files
On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: I believe David already fixed this. I hope this is the same issue he told about permissions issue. Oops, it is not. I will take a look. Pranith Pranith On 02/05/2015 03:44 PM, Xavier Hernandez wrote: Is the failure repeatable ? with the same directories ? It's very weird that the directories appear on the volume when you do an 'ls' on the bricks. Could it be that you only made a single 'ls' on fuse mount which not showed the directory ? Is it possible that this 'ls' triggered a self-heal that repaired the problem, whatever it was, and when you did another 'ls' on the fuse mount after the 'ls' on the bricks, the directories were there ? The first 'ls' could have healed the files, causing that the following 'ls' on the bricks showed the files as if nothing were damaged. If that's the case, it's possible that there were some disconnections during the copy. Added Pranith because he knows better replication and self-heal details. Xavi On 02/04/2015 07:23 PM, David F. Robinson wrote: Distributed/replicated Volume Name: homegfs Type: Distributed-Replicate Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs Options Reconfigured: performance.io-thread-count: 32 performance.cache-size: 128MB performance.write-behind-window-size: 128MB server.allow-insecure: on network.ping-timeout: 10 storage.owner-gid: 100 geo-replication.indexing: off geo-replication.ignore-pid-check: on changelog.changelog: on changelog.fsync-interval: 3 changelog.rollover-time: 15 server.manage-gids: on -- Original Message -- From: "Xavier Hernandez" To: "David F. Robinson" ; "Benjamin Turner" Cc: "gluster-us...@gluster.org" ; "Gluster Devel" Sent: 2/4/2015 6:03:45 AM Subject: Re: [Gluster-devel] missing files On 02/04/2015 01:30 AM, David F. Robinson wrote: Sorry. Thought about this a little more. I should have been clearer. The files were on both bricks of the replica, not just one side. So, both bricks had to have been up... The files/directories just don't show up on the mount. I was reading and saw a related bug (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it suggested to run: find -d -exec getfattr -h -n trusted.ec.heal {} \; This command is specific for a dispersed volume. It won't do anything (aside from the error you are seeing) on a replicated volume. I think you are using a replicated volume, right ? In this case I'm not sure what can be happening. Is your volume a pure replicated one or a distributed-replicated ? on a pure replicated it doesn't make sense that some entries do not show in an 'ls' when the file is in both replicas (at least without any error message in the logs). On a distributed-replicated it could be caused by some problem while combining contents of each replica set. What's the configuration of your volume ? Xavi I get a bunch of errors for operation not supported: [root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n trusted.ec.heal {} \; find: warning: the -d option is deprecated; please use -depth instead, because the latter is a POSIX-compliant feature. wks_backup/homer_backup/backup: trusted.ec.heal: Operation not supported wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: Operation not supported wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: Operation not supported wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: Operation not supported wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: Operation not supported wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: Operation not supported wks_backup/homer_backup/logs: trusted.ec.heal: Operation not supported wks_backup/homer_backup: trusted.ec.heal: Operation not supported -- Original Message -- From: "Benjamin Turner" mailto:bennytu...@gmail.com>> To: "David F. Robinson" mailto:david.robin...@corvidtec.com>> Cc: "Gluster Devel" mailto:gluster-devel@gluster.org>>; "gluster-us...@gluster.org" mailto:gluster-us...@gluster.org>> Sent: 2/3/2015 7:12:34 PM Subject: Re: [Gluster-devel] missing files It sounds to me like the files were only copied to one replica, werent there for the initial for the initial ls which triggered a self heal, and were there for the last ls because they were healed. Is there any chance that one of the replicas was down during the rsync? It could be that you lost a brick during copy or something like that. To confirm I would lo