Re: [Gluster-devel] [Gluster-users] missing files

2015-02-08 Thread Niels de Vos
On Sun, Feb 08, 2015 at 01:43:55PM +, Justin Clift wrote:
> On 6 Feb 2015, at 20:33, Ben Turner  wrote:
> > - Original Message -
> >> From: "Justin Clift" 
> >> To: "Benjamin Turner" 
> >> Cc: "David F. Robinson" , 
> >> gluster-us...@gluster.org, "Gluster Devel"
> >> , "Ben Turner" 
> >> Sent: Friday, February 6, 2015 3:27:53 PM
> >> Subject: Re: [Gluster-devel] [Gluster-users]  missing files
> >> 
> >> On 6 Feb 2015, at 02:05, Benjamin Turner  wrote:
> >>> I think that the multi threaded epoll changes that _just_ landed in master
> >>> will help resolve this, but they are so new I haven't been able to test
> >>> this.  I'll know more when I get a chance to test tomorrow.
> >> 
> >> Which multi-threaded epoll code just landed in master?  Are you thinking
> >> of this one?
> >> 
> >>  http://review.gluster.org/#/c/3842/
> >> 
> >> If so, it's not in master yet. ;)
> > 
> > Doh!  I just saw - "Required patches are all upstream now" and assumed they 
> > were merged.  I have been in class all week so I am not up2date with 
> > everything.  I gave instructions on compiling it from the gerrit patches + 
> > master so if David wants to give it a go he can.  Sorry for the confusion.
> 
> Vijay merged the code into master yesterday, so it should be too long under we
> can get some rpms created for people to test with (easily). :)

Nightly buils are already available that have this change:

http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs/

Niels


pgpLYOf8J3IKn.pgp
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] missing files

2015-02-08 Thread Ben Turner
- Original Message -
> From: "Ben Turner" 
> To: "David F. Robinson" 
> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
> , "Benjamin Turner"
> , gluster-us...@gluster.org, "Gluster Devel" 
> 
> Sent: Thursday, February 5, 2015 5:22:26 PM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> - Original Message -
> > From: "David F. Robinson" 
> > To: "Ben Turner" 
> > Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
> > , "Benjamin Turner"
> > , gluster-us...@gluster.org, "Gluster Devel"
> > 
> > Sent: Thursday, February 5, 2015 5:01:13 PM
> > Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > 
> > I'll send you the emails I sent Pranith with the logs. What causes these
> > disconnects?
> 
> Thanks David!  Disconnects happen when there are interruption in
> communication between peers, normally there is ping timeout that happens.
> It could be anything from a flaky NW to the system was to busy to respond
> to the pings.  My initial take is more towards the ladder as rsync is
> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
> try to keep my writes at least 64KB as in my testing that is the smallest
> block size I can write with before perf starts to really drop off.  I'll try
> something similar in the lab.

Ok I do think that the file being self healed is RCA for what you were seeing.  
Lets look at one of the disconnects:

data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

And in the glustershd.log from the gfs01b_glustershd.log file:

[2015-02-03 20:55:48.001797] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
[2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
[2015-02-03 20:55:49.343093] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
[2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
[2015-02-03 20:55:51.465289] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: 
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed metadata selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
[2015-02-03 20:55:51.467098] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
[2015-02-03 20:55:55.258548] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: 
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
[2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed metadata selfheal on 
c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
[2015-02-03 20:55:55.259980] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541

As you can see the self heal logs are just spammed with files being healed, and 
I looked at a couple of disconnects and I see self heals getting run shortly 
after on the bricks that were down.  Now we need to find the cause of the 
disconnects, I am thinking once the disconnects are resolved the files should 
be properly copied over without SH having to fix things.  Like I said I'll give 
this a go on my lab systems and see if I can repro the disconnects, I'll have 
time to run through it tomorrow.  If in the mean time anyone else has a theory 
/ anything to add here it would be appreciated.

-b
 
> -b
>  
> > David  (Sent from mobile)
> > 
> > ===
> > David F. Robinson, Ph.D.
> > President - Corvid Technologies
> > 704.799.6944 x101 [office]
> > 704.252.1310  [cell]
> > 704.799.7974  [fax]
> > david.robin...@corvidtec.com
> > http://www.corvidtechnologies.com
> > 
> > > On Feb 5, 2015, at 4:55 PM, Ben Turner  wrote:
> > > 
> > > - Original Message -
> > >> From: "Pranith Kumar Karampuri" 
> > >> To: "Xavier Hernandez" , "David F. Robinson"
> > >> , "Benjamin Turner"
> > >> 
> > >> Cc: gluster-us...@gluster.org, "Gluster Devel"
> > >> 
> > >> Sent: Thursday, February 5, 2015 5:30:04 AM
> > >> Subject: Re: [Gluster-users] [Gluster-devel] missin

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-08 Thread Ben Turner
- Original Message -
> From: "Justin Clift" 
> To: "Benjamin Turner" 
> Cc: "David F. Robinson" , 
> gluster-us...@gluster.org, "Gluster Devel"
> , "Ben Turner" 
> Sent: Friday, February 6, 2015 3:27:53 PM
> Subject: Re: [Gluster-devel] [Gluster-users]  missing files
> 
> On 6 Feb 2015, at 02:05, Benjamin Turner  wrote:
> > I think that the multi threaded epoll changes that _just_ landed in master
> > will help resolve this, but they are so new I haven't been able to test
> > this.  I'll know more when I get a chance to test tomorrow.
> 
> Which multi-threaded epoll code just landed in master?  Are you thinking
> of this one?
> 
>   http://review.gluster.org/#/c/3842/
> 
> If so, it's not in master yet. ;)

Doh!  I just saw - "Required patches are all upstream now" and assumed they 
were merged.  I have been in class all week so I am not up2date with 
everything.  I gave instructions on compiling it from the gerrit patches + 
master so if David wants to give it a go he can.  Sorry for the confusion.

-b
 
> + Justin
> 
> 
> > -b
> > 
> > On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson
> >  wrote:
> > Isn't rsync what geo-rep uses?
> > 
> > David  (Sent from mobile)
> > 
> > ===
> > David F. Robinson, Ph.D.
> > President - Corvid Technologies
> > 704.799.6944 x101 [office]
> > 704.252.1310  [cell]
> > 704.799.7974  [fax]
> > david.robin...@corvidtec.com
> > http://www.corvidtechnologies.com
> > 
> > > On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> > >
> > > - Original Message -
> > >> From: "Ben Turner" 
> > >> To: "David F. Robinson" 
> > >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
> > >> , "Benjamin Turner"
> > >> , gluster-us...@gluster.org, "Gluster Devel"
> > >> 
> > >> Sent: Thursday, February 5, 2015 5:22:26 PM
> > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > >>
> > >> - Original Message -
> > >>> From: "David F. Robinson" 
> > >>> To: "Ben Turner" 
> > >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
> > >>> , "Benjamin Turner"
> > >>> , gluster-us...@gluster.org, "Gluster Devel"
> > >>> 
> > >>> Sent: Thursday, February 5, 2015 5:01:13 PM
> > >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > >>>
> > >>> I'll send you the emails I sent Pranith with the logs. What causes
> > >>> these
> > >>> disconnects?
> > >>
> > >> Thanks David!  Disconnects happen when there are interruption in
> > >> communication between peers, normally there is ping timeout that
> > >> happens.
> > >> It could be anything from a flaky NW to the system was to busy to
> > >> respond
> > >> to the pings.  My initial take is more towards the ladder as rsync is
> > >> absolutely the worst use case for gluster - IIRC it writes in 4kb
> > >> blocks.  I
> > >> try to keep my writes at least 64KB as in my testing that is the
> > >> smallest
> > >> block size I can write with before perf starts to really drop off.  I'll
> > >> try
> > >> something similar in the lab.
> > >
> > > Ok I do think that the file being self healed is RCA for what you were
> > > seeing.  Lets look at one of the disconnects:
> > >
> > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> > >
> > > And in the glustershd.log from the gfs01b_glustershd.log file:
> > >
> > > [2015-02-03 20:55:48.001797] I
> > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> > > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> > > [2015-02-03 20:55:49.341996] I
> > > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> > > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448.
> > > source=1 sinks=0
> > > [2015-02-03 20:55:49.343093] I
> > > [afr-self-heal-entry.c:55

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-08 Thread Ben Turner
- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Xavier Hernandez" , "David F. Robinson" 
> , "Benjamin Turner"
> 
> Cc: gluster-us...@gluster.org, "Gluster Devel" 
> Sent: Thursday, February 5, 2015 5:30:04 AM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> 
> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> > I believe David already fixed this. I hope this is the same issue he
> > told about permissions issue.
> Oops, it is not. I will take a look.

Yes David exactly like these:

data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

You can 100% verify my theory if you can correlate the time on the disconnects 
to the time that the missing files were healed.  Can you have a look at 
/var/log/glusterfs/glustershd.log?  That has all of the healed files + 
timestamps, if we can see a disconnect during the rsync and a self heal of the 
missing file I think we can safely assume that the disconnects may have caused 
this.  I'll try this on my test systems, how much data did you rsync?  What 
size ish of files / an idea of the dir layout?  

@Pranith - Could bricks flapping up and down during the rsync cause the files 
to be missing on the first ls(written to 1 subvol but not the other cause it 
was down), the ls triggered SH, and thats why the files were there for the 
second ls be a possible cause here?

-b

 
> Pranith
> >
> > Pranith
> > On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> >> Is the failure repeatable ? with the same directories ?
> >>
> >> It's very weird that the directories appear on the volume when you do
> >> an 'ls' on the bricks. Could it be that you only made a single 'ls'
> >> on fuse mount which not showed the directory ? Is it possible that
> >> this 'ls' triggered a self-heal that repaired the problem, whatever
> >> it was, and when you did another 'ls' on the fuse mount after the
> >> 'ls' on the bricks, the directories were there ?
> >>
> >> The first 'ls' could have healed the files, causing that the
> >> following 'ls' on the bricks showed the files as if nothing were
> >> damaged. If that's the case, it's possible that there were some
> >> disconnections during the copy.
> >>
> >> Added Pranith because he knows better replication and self-heal details.
> >>
> >> Xavi
> >>
> >> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> >>> Distributed/replicated
> >>>
> >>> Volume Name: homegfs
> >>> Type: Distributed-Replicate
> >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> >>> Status: Started
> >>> Number of Bricks: 4 x 2 = 8
> >>> Transport-type: tcp
> >>> Bricks:
> >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> >>> Options Reconfigured:
> >>> performance.io-thread-count: 32
> >>> performance.cache-size: 128MB
> >>> performance.write-behind-window-size: 128MB
> >>> server.allow-insecure: on
> >>> network.ping-timeout: 10
> >>> storage.owner-gid: 100
> >>> geo-replication.indexing: off
> >>> geo-replication.ignore-pid-check: on
> >>> changelog.changelog: on
> >>> changelog.fsync-interval: 3
> >>> changelog.rollover-time: 15
> >>> server.manage-gids: on
> >>>
> >>>
> >>> -- Original Message --
> >>> From: "Xavier Hernandez" 
> >>> To: "David F. Robinson" ; "Benjamin
> >>> Turner" 
> >>> Cc: "gluster-us...@gluster.org" ; "Gluster
> >>> Devel" 
> >>> Sent: 2/4/2015 6:03:45 AM
> >>> Subject: Re: [Gluster-devel] missing files
> >>>
>  On 02/04/2015 01:30 AM, David F. Robinson wrote:
> > Sorry. Thought about this a little more. I should have been clearer.
> > The files were on both bricks o

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-08 Thread Ben Turner
- Original Message -
> From: "David F. Robinson" 
> To: "Ben Turner" 
> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
> , "Benjamin Turner"
> , gluster-us...@gluster.org, "Gluster Devel" 
> 
> Sent: Thursday, February 5, 2015 5:01:13 PM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> I'll send you the emails I sent Pranith with the logs. What causes these
> disconnects?

Thanks David!  Disconnects happen when there are interruption in communication 
between peers, normally there is ping timeout that happens.  It could be 
anything from a flaky NW to the system was to busy to respond to the pings.  My 
initial take is more towards the ladder as rsync is absolutely the worst use 
case for gluster - IIRC it writes in 4kb blocks.  I try to keep my writes at 
least 64KB as in my testing that is the smallest block size I can write with 
before perf starts to really drop off.  I'll try something similar in the lab.

-b
 
> David  (Sent from mobile)
> 
> ===
> David F. Robinson, Ph.D.
> President - Corvid Technologies
> 704.799.6944 x101 [office]
> 704.252.1310  [cell]
> 704.799.7974  [fax]
> david.robin...@corvidtec.com
> http://www.corvidtechnologies.com
> 
> > On Feb 5, 2015, at 4:55 PM, Ben Turner  wrote:
> > 
> > - Original Message -
> >> From: "Pranith Kumar Karampuri" 
> >> To: "Xavier Hernandez" , "David F. Robinson"
> >> , "Benjamin Turner"
> >> 
> >> Cc: gluster-us...@gluster.org, "Gluster Devel" 
> >> Sent: Thursday, February 5, 2015 5:30:04 AM
> >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >> 
> >> 
> >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> >>> I believe David already fixed this. I hope this is the same issue he
> >>> told about permissions issue.
> >> Oops, it is not. I will take a look.
> > 
> > Yes David exactly like these:
> > 
> > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> > gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> > 
> > You can 100% verify my theory if you can correlate the time on the
> > disconnects to the time that the missing files were healed.  Can you have
> > a look at /var/log/glusterfs/glustershd.log?  That has all of the healed
> > files + timestamps, if we can see a disconnect during the rsync and a self
> > heal of the missing file I think we can safely assume that the disconnects
> > may have caused this.  I'll try this on my test systems, how much data did
> > you rsync?  What size ish of files / an idea of the dir layout?
> > 
> > @Pranith - Could bricks flapping up and down during the rsync cause the
> > files to be missing on the first ls(written to 1 subvol but not the other
> > cause it was down), the ls triggered SH, and thats why the files were
> > there for the second ls be a possible cause here?
> > 
> > -b
> > 
> > 
> >> Pranith
> >>> 
> >>> Pranith
>  On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
>  Is the failure repeatable ? with the same directories ?
>  
>  It's very weird that the directories appear on the volume when you do
>  an 'ls' on the bricks. Could it be that you only made a single 'ls'
>  on fuse mount which not showed the directory ? Is it possible that
>  this 'ls' triggered a self-heal that repaired the problem, whatever
>  it was, and when you did another 'ls' on the fuse mount after the
>  'ls' on the bricks, the directories were there ?
>  
>  The first 'ls' could have healed the files, causing that the
>  following 'ls' on the bricks showed the files as if nothing were
>  damaged. If that's the case, it's possible that there were some
>  disconnections during the copy.
>  
>  Added Pranith because he knows better replication and self-heal details.
>  
>  Xavi
>  
> > On 02/04/2015 07:23 PM, David F. Robinson wrote:
> > Distributed/replicated
> > 
> > Volume Name: homegfs
> > Type: Distributed-Replicate
> >

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-08 Thread Justin Clift
On 6 Feb 2015, at 20:33, Ben Turner  wrote:
> - Original Message -
>> From: "Justin Clift" 
>> To: "Benjamin Turner" 
>> Cc: "David F. Robinson" , 
>> gluster-us...@gluster.org, "Gluster Devel"
>> , "Ben Turner" 
>> Sent: Friday, February 6, 2015 3:27:53 PM
>> Subject: Re: [Gluster-devel] [Gluster-users]  missing files
>> 
>> On 6 Feb 2015, at 02:05, Benjamin Turner  wrote:
>>> I think that the multi threaded epoll changes that _just_ landed in master
>>> will help resolve this, but they are so new I haven't been able to test
>>> this.  I'll know more when I get a chance to test tomorrow.
>> 
>> Which multi-threaded epoll code just landed in master?  Are you thinking
>> of this one?
>> 
>>  http://review.gluster.org/#/c/3842/
>> 
>> If so, it's not in master yet. ;)
> 
> Doh!  I just saw - "Required patches are all upstream now" and assumed they 
> were merged.  I have been in class all week so I am not up2date with 
> everything.  I gave instructions on compiling it from the gerrit patches + 
> master so if David wants to give it a go he can.  Sorry for the confusion.

Vijay merged the code into master yesterday, so it should be too long under we
can get some rpms created for people to test with (easily). :)

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] missing files

2015-02-06 Thread David F. Robinson
I don't think I understood what you sent enough to give it a try.  I'll 
wait until it comes out in a beta or release version.


David


-- Original Message --
From: "Ben Turner" 
To: "Justin Clift" ; "David F. Robinson" 

Cc: "Benjamin Turner" ; gluster-us...@gluster.org; 
"Gluster Devel" 

Sent: 2/6/2015 3:33:42 PM
Subject: Re: [Gluster-devel] [Gluster-users] missing files


- Original Message -

 From: "Justin Clift" 
 To: "Benjamin Turner" 
 Cc: "David F. Robinson" , 
gluster-us...@gluster.org, "Gluster Devel"

 , "Ben Turner" 
 Sent: Friday, February 6, 2015 3:27:53 PM
 Subject: Re: [Gluster-devel] [Gluster-users] missing files

 On 6 Feb 2015, at 02:05, Benjamin Turner  
wrote:
 > I think that the multi threaded epoll changes that _just_ landed in 
master
 > will help resolve this, but they are so new I haven't been able to 
test

 > this. I'll know more when I get a chance to test tomorrow.

 Which multi-threaded epoll code just landed in master? Are you 
thinking

 of this one?

   http://review.gluster.org/#/c/3842/

 If so, it's not in master yet. ;)


Doh! I just saw - "Required patches are all upstream now" and assumed 
they were merged. I have been in class all week so I am not up2date 
with everything. I gave instructions on compiling it from the gerrit 
patches + master so if David wants to give it a go he can. Sorry for 
the confusion.


-b


 + Justin


 > -b
 >
 > On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson
 >  wrote:
 > Isn't rsync what geo-rep uses?
 >
 > David (Sent from mobile)
 >
 > ===
 > David F. Robinson, Ph.D.
 > President - Corvid Technologies
 > 704.799.6944 x101 [office]
 > 704.252.1310 [cell]
 > 704.799.7974 [fax]
 > david.robin...@corvidtec.com
 > http://www.corvidtechnologies.com
 >
 > > On Feb 5, 2015, at 5:41 PM, Ben Turner  
wrote:

 > >
 > > - Original Message -
 > >> From: "Ben Turner" 
 > >> To: "David F. Robinson" 
 > >> Cc: "Pranith Kumar Karampuri" , "Xavier 
Hernandez"

 > >> , "Benjamin Turner"
 > >> , gluster-us...@gluster.org, "Gluster 
Devel"

 > >> 
 > >> Sent: Thursday, February 5, 2015 5:22:26 PM
 > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
 > >>
 > >> - Original Message -
 > >>> From: "David F. Robinson" 
 > >>> To: "Ben Turner" 
 > >>> Cc: "Pranith Kumar Karampuri" , "Xavier 
Hernandez"

 > >>> , "Benjamin Turner"
 > >>> , gluster-us...@gluster.org, "Gluster 
Devel"

 > >>> 
 > >>> Sent: Thursday, February 5, 2015 5:01:13 PM
 > >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
 > >>>
 > >>> I'll send you the emails I sent Pranith with the logs. What 
causes

 > >>> these
 > >>> disconnects?
 > >>
 > >> Thanks David! Disconnects happen when there are interruption in
 > >> communication between peers, normally there is ping timeout that
 > >> happens.
 > >> It could be anything from a flaky NW to the system was to busy 
to

 > >> respond
 > >> to the pings. My initial take is more towards the ladder as 
rsync is
 > >> absolutely the worst use case for gluster - IIRC it writes in 
4kb

 > >> blocks. I
 > >> try to keep my writes at least 64KB as in my testing that is the
 > >> smallest
 > >> block size I can write with before perf starts to really drop 
off. I'll

 > >> try
 > >> something similar in the lab.
 > >
 > > Ok I do think that the file being self healed is RCA for what you 
were

 > > seeing. Lets look at one of the disconnects:
 > >
 > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
 > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
 > > connection from
 > > 
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

 > >
 > > And in the glustershd.log from the gfs01b_glustershd.log file:
 > >
 > > [2015-02-03 20:55:48.001797] I
 > > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0:

 > > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
 > > [2015-02-03 20:55:49.341996] I
 > > [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0:

 > > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448.
 > > source=1 sinks=0
 > > [2015-02-

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-06 Thread Justin Clift
On 6 Feb 2015, at 02:05, Benjamin Turner  wrote:
> I think that the multi threaded epoll changes that _just_ landed in master 
> will help resolve this, but they are so new I haven't been able to test this. 
>  I'll know more when I get a chance to test tomorrow.

Which multi-threaded epoll code just landed in master?  Are you thinking
of this one?

  http://review.gluster.org/#/c/3842/

If so, it's not in master yet. ;)

+ Justin


> -b
> 
> On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson 
>  wrote:
> Isn't rsync what geo-rep uses?
> 
> David  (Sent from mobile)
> 
> ===
> David F. Robinson, Ph.D.
> President - Corvid Technologies
> 704.799.6944 x101 [office]
> 704.252.1310  [cell]
> 704.799.7974  [fax]
> david.robin...@corvidtec.com
> http://www.corvidtechnologies.com
> 
> > On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> >
> > - Original Message -
> >> From: "Ben Turner" 
> >> To: "David F. Robinson" 
> >> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
> >> , "Benjamin Turner"
> >> , gluster-us...@gluster.org, "Gluster Devel" 
> >> 
> >> Sent: Thursday, February 5, 2015 5:22:26 PM
> >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >>
> >> - Original Message -
> >>> From: "David F. Robinson" 
> >>> To: "Ben Turner" 
> >>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
> >>> , "Benjamin Turner"
> >>> , gluster-us...@gluster.org, "Gluster Devel"
> >>> 
> >>> Sent: Thursday, February 5, 2015 5:01:13 PM
> >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >>>
> >>> I'll send you the emails I sent Pranith with the logs. What causes these
> >>> disconnects?
> >>
> >> Thanks David!  Disconnects happen when there are interruption in
> >> communication between peers, normally there is ping timeout that happens.
> >> It could be anything from a flaky NW to the system was to busy to respond
> >> to the pings.  My initial take is more towards the ladder as rsync is
> >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  
> >> I
> >> try to keep my writes at least 64KB as in my testing that is the smallest
> >> block size I can write with before perf starts to really drop off.  I'll 
> >> try
> >> something similar in the lab.
> >
> > Ok I do think that the file being self healed is RCA for what you were 
> > seeing.  Lets look at one of the disconnects:
> >
> > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> > from 
> > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> >
> > And in the glustershd.log from the gfs01b_glustershd.log file:
> >
> > [2015-02-03 20:55:48.001797] I 
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> > performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> > [2015-02-03 20:55:49.341996] I 
> > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
> > Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 
> > sinks=0
> > [2015-02-03 20:55:49.343093] I 
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> > performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> > [2015-02-03 20:55:50.463652] I 
> > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
> > Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 
> > sinks=0
> > [2015-02-03 20:55:51.465289] I 
> > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> > 0-homegfs-replicate-0: performing metadata selfheal on 
> > 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:51.466515] I 
> > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
> > Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. 
> > source=1 sinks=0
> > [2015-02-03 20:55:51.467098] I 
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> > performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:55.257808] I 
> > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
> > Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 
> > sinks=0
> > [2015-02-03 20:55:55.258548] I 
> > [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> > 0-homegfs-replicate-0: performing metadata selfheal on 
> > c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> > [2015-02-03 20:55:55.259367] I 
> > [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
> > Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. 
> > source=1 sinks=0
> > [2015-02-03 20:55:55.259980] I 
> > [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> > performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> >
> > As you can see the self heal logs are just spammed with files being healed, 
> > and I looked at a couple of disconnects and I see

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson

copy that.  Thanks for looking into the issue.

David


-- Original Message --
From: "Benjamin Turner" 
To: "David F. Robinson" 
Cc: "Ben Turner" ; "Pranith Kumar Karampuri" 
; "Xavier Hernandez" ; 
"gluster-us...@gluster.org" ; "Gluster Devel" 


Sent: 2/5/2015 9:05:43 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

Correct!  I have seen(back in the day, its been 3ish years since I have 
seen it) having say 50+ volumes each with a geo rep session take system 
load levels to the point where pings couldn't be serviced within the 
ping timeout.  So it is known to happen but there has been alot of work 
in the geo rep space to help here, some of which is discussed:


https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50

(think tar + ssh and other fixes)Your symptoms remind me of that case 
of 50+ geo repd volumes, thats why I mentioned it from the start.  My 
current shoot from the hip theory is when rsyncing all that data the 
servers got too busy to service the pings and it lead to disconnects.  
This is common across all of the clustering / distributed software I 
have worked on, if the system gets too busy to service heartbeat within 
the timeout things go crazy(think fork bomb on a single host).  Now 
this could be a case of me putting symptoms from an old issue into what 
you are describing, but thats where my head is at.  If I'm correct I 
should be able to repro using a similar workload.  I think that the 
multi threaded epoll changes that _just_ landed in master will help 
resolve this, but they are so new I haven't been able to test this.  
I'll know more when I get a chance to test tomorrow.


-b

On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson 
 wrote:

Isn't rsync what geo-rep uses?

David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
>
> - Original Message -
>> From: "Ben Turner" 
>> To: "David F. Robinson" 
>> Cc: "Pranith Kumar Karampuri" , "Xavier 
Hernandez" , "Benjamin Turner"
>> , gluster-us...@gluster.org, "Gluster Devel" 


>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>
>> - Original Message -
>>> From: "David F. Robinson" 
>>> To: "Ben Turner" 
>>> Cc: "Pranith Kumar Karampuri" , "Xavier 
Hernandez"

>>> , "Benjamin Turner"
>>> , gluster-us...@gluster.org, "Gluster Devel"
>>> 
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>>
>>> I'll send you the emails I sent Pranith with the logs. What causes 
these

>>> disconnects?
>>
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that 
happens.
>> It could be anything from a flaky NW to the system was to busy to 
respond
>> to the pings.  My initial take is more towards the ladder as rsync 
is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb 
blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the 
smallest
>> block size I can write with before perf starts to really drop off.  
I'll try

>> something similar in the lab.
>
> Ok I do think that the file being self healed is RCA for what you 
were seeing.  Lets look at one of the disconnects:

>
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

>
> And in the glustershd.log from the gfs01b_glustershd.log file:
>
> [2015-02-03 20:55:48.001797] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0: performing entry selfheal on 
6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. 
source=1 sinks=0
> [2015-02-03 20:55:49.343093] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0: performing entry selfheal on 
792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. 
source=1 sinks=0
> [2015-02-03 20:55:51.465289] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
0-homegfs-replicate-0: performing metadata selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. 
source=1 sinks=0
> [2015-02-03 20:55:51.467098] I 
[afr-self-heal-entry.c:554:afr_selfheal_

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Benjamin Turner
Correct!  I have seen(back in the day, its been 3ish years since I have
seen it) having say 50+ volumes each with a geo rep session take system
load levels to the point where pings couldn't be serviced within the ping
timeout.  So it is known to happen but there has been alot of work in the
geo rep space to help here, some of which is discussed:

https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50

(think tar + ssh and other fixes)Your symptoms remind me of that case of
50+ geo repd volumes, thats why I mentioned it from the start.  My current
shoot from the hip theory is when rsyncing all that data the servers got
too busy to service the pings and it lead to disconnects.  This is common
across all of the clustering / distributed software I have worked on, if
the system gets too busy to service heartbeat within the timeout things go
crazy(think fork bomb on a single host).  Now this could be a case of me
putting symptoms from an old issue into what you are describing, but thats
where my head is at.  If I'm correct I should be able to repro using a
similar workload.  I think that the multi threaded epoll changes that
_just_ landed in master will help resolve this, but they are so new I
haven't been able to test this.  I'll know more when I get a chance to test
tomorrow.

-b

On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson <
david.robin...@corvidtec.com> wrote:

> Isn't rsync what geo-rep uses?
>
> David  (Sent from mobile)
>
> ===
> David F. Robinson, Ph.D.
> President - Corvid Technologies
> 704.799.6944 x101 [office]
> 704.252.1310  [cell]
> 704.799.7974  [fax]
> david.robin...@corvidtec.com
> http://www.corvidtechnologies.com
>
> > On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> >
> > - Original Message -
> >> From: "Ben Turner" 
> >> To: "David F. Robinson" 
> >> Cc: "Pranith Kumar Karampuri" , "Xavier
> Hernandez" , "Benjamin Turner"
> >> , gluster-us...@gluster.org, "Gluster Devel" <
> gluster-devel@gluster.org>
> >> Sent: Thursday, February 5, 2015 5:22:26 PM
> >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >>
> >> - Original Message -
> >>> From: "David F. Robinson" 
> >>> To: "Ben Turner" 
> >>> Cc: "Pranith Kumar Karampuri" , "Xavier
> Hernandez"
> >>> , "Benjamin Turner"
> >>> , gluster-us...@gluster.org, "Gluster Devel"
> >>> 
> >>> Sent: Thursday, February 5, 2015 5:01:13 PM
> >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >>>
> >>> I'll send you the emails I sent Pranith with the logs. What causes
> these
> >>> disconnects?
> >>
> >> Thanks David!  Disconnects happen when there are interruption in
> >> communication between peers, normally there is ping timeout that
> happens.
> >> It could be anything from a flaky NW to the system was to busy to
> respond
> >> to the pings.  My initial take is more towards the ladder as rsync is
> >> absolutely the worst use case for gluster - IIRC it writes in 4kb
> blocks.  I
> >> try to keep my writes at least 64KB as in my testing that is the
> smallest
> >> block size I can write with before perf starts to really drop off.
> I'll try
> >> something similar in the lab.
> >
> > Ok I do think that the file being self healed is RCA for what you were
> seeing.  Lets look at one of the disconnects:
> >
> > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> >
> > And in the glustershd.log from the gfs01b_glustershd.log file:
> >
> > [2015-02-03 20:55:48.001797] I
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> > [2015-02-03 20:55:49.341996] I
> [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1
> sinks=0
> > [2015-02-03 20:55:49.343093] I
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> > [2015-02-03 20:55:50.463652] I
> [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1
> sinks=0
> > [2015-02-03 20:55:51.465289] I
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
> 0-homegfs-replicate-0: performing metadata selfheal on
> 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:51.466515] I
> [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c.
> source=1 sinks=0
> > [2015-02-03 20:55:51.467098] I
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:55.257808] I
> [afr-self-heal-common.c:476:afr_

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Ben Turner" 
> To: "Pranith Kumar Karampuri" , "David F. Robinson" 
> 
> Cc: "Xavier Hernandez" , "Benjamin Turner" 
> , gluster-us...@gluster.org,
> "Gluster Devel" 
> Sent: Friday, February 6, 2015 3:25:28 AM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> - Original Message -
> > From: "Pranith Kumar Karampuri" 
> > To: "Xavier Hernandez" , "David F. Robinson"
> > , "Benjamin Turner"
> > 
> > Cc: gluster-us...@gluster.org, "Gluster Devel" 
> > Sent: Thursday, February 5, 2015 5:30:04 AM
> > Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > 
> > 
> > On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> > > I believe David already fixed this. I hope this is the same issue he
> > > told about permissions issue.
> > Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the
> disconnects to the time that the missing files were healed.  Can you have a
> look at /var/log/glusterfs/glustershd.log?  That has all of the healed files
> + timestamps, if we can see a disconnect during the rsync and a self heal of
> the missing file I think we can safely assume that the disconnects may have
> caused this.  I'll try this on my test systems, how much data did you rsync?
> What size ish of files / an idea of the dir layout?
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files
> to be missing on the first ls(written to 1 subvol but not the other cause it
> was down), the ls triggered SH, and thats why the files were there for the
> second ls be a possible cause here?

No it would be a bug. Afr should serve the directory contents from the brick 
with those files.

> 
> -b
> 
>  
> > Pranith
> > >
> > > Pranith
> > > On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> > >> Is the failure repeatable ? with the same directories ?
> > >>
> > >> It's very weird that the directories appear on the volume when you do
> > >> an 'ls' on the bricks. Could it be that you only made a single 'ls'
> > >> on fuse mount which not showed the directory ? Is it possible that
> > >> this 'ls' triggered a self-heal that repaired the problem, whatever
> > >> it was, and when you did another 'ls' on the fuse mount after the
> > >> 'ls' on the bricks, the directories were there ?
> > >>
> > >> The first 'ls' could have healed the files, causing that the
> > >> following 'ls' on the bricks showed the files as if nothing were
> > >> damaged. If that's the case, it's possible that there were some
> > >> disconnections during the copy.
> > >>
> > >> Added Pranith because he knows better replication and self-heal details.
> > >>
> > >> Xavi
> > >>
> > >> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> > >>> Distributed/replicated
> > >>>
> > >>> Volume Name: homegfs
> > >>> Type: Distributed-Replicate
> > >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> > >>> Status: Started
> > >>> Number of Bricks: 4 x 2 = 8
> > >>> Transport-type: tcp
> > >>> Bricks:
> > >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> > >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> > >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> > >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> > >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> > >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> > >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> > >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> > >>> Options Reconfigured:
> > >>> performance.io-thread-count: 32
> > >>> performance.cache-size: 128MB
> > >>> performance.write-behind-window-size: 128MB
> > >>> server.allow-insecure: on
> > >>> network.ping-timeout: 10
> > >>> storage.owner-gid: 100
> > >>> geo-replication.indexing: off
> > >>> geo-replication.ignore-pid-c

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
Isn't rsync what geo-rep uses?

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Ben Turner" 
>> To: "David F. Robinson" 
>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
>> , "Benjamin Turner"
>> , gluster-us...@gluster.org, "Gluster Devel" 
>> 
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> - Original Message -
>>> From: "David F. Robinson" 
>>> To: "Ben Turner" 
>>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
>>> , "Benjamin Turner"
>>> , gluster-us...@gluster.org, "Gluster Devel"
>>> 
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>> 
>>> I'll send you the emails I sent Pranith with the logs. What causes these
>>> disconnects?
>> 
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that happens.
>> It could be anything from a flaky NW to the system was to busy to respond
>> to the pings.  My initial take is more towards the ladder as rsync is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the smallest
>> block size I can write with before perf starts to really drop off.  I'll try
>> something similar in the lab.
> 
> Ok I do think that the file being self healed is RCA for what you were 
> seeing.  Lets look at one of the disconnects:
> 
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> And in the glustershd.log from the gfs01b_glustershd.log file:
> 
> [2015-02-03 20:55:48.001797] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
> [2015-02-03 20:55:49.343093] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
> [2015-02-03 20:55:51.465289] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:51.467098] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:55.258548] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
> [2015-02-03 20:55:55.259980] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> 
> As you can see the self heal logs are just spammed with files being healed, 
> and I looked at a couple of disconnects and I see self heals getting run 
> shortly after on the bricks that were down.  Now we need to find the cause of 
> the disconnects, I am thinking once the disconnects are resolved the files 
> should be properly copied over without SH having to fix things.  Like I said 
> I'll give this a go on my lab systems and see if I can repro the disconnects, 
> I'll have time to run through it tomorrow.  If in the mean time anyone else 
> has a theory / anything to add here it would be appreciated.
> 
> -b
> 
>> -b
>> 
>>> David  (Sent from mobile)
>>> 
>>> ===
>>> David F. Robinson, Ph.D.
>>> President - Corvid Technologies
>>> 704.799.6944 x101 [office]
>>> 704.252.1310  [cell]
>>> 704.799.7974   

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Joe Julian

Out of curiosity, are you using --inplace?

On 02/05/2015 02:59 PM, David F. Robinson wrote:

Should I run my rsync with --block-size = something other than the default? Is 
there an optimal value? I think 128k is the max from my quick search. Didn't 
dig into it throughly though.

David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com


On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:

- Original Message -

From: "Ben Turner" 
To: "David F. Robinson" 
Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
, "Benjamin Turner"
, gluster-us...@gluster.org, "Gluster Devel" 

Sent: Thursday, February 5, 2015 5:22:26 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

- Original Message -

From: "David F. Robinson" 
To: "Ben Turner" 
Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
, "Benjamin Turner"
, gluster-us...@gluster.org, "Gluster Devel"

Sent: Thursday, February 5, 2015 5:01:13 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

I'll send you the emails I sent Pranith with the logs. What causes these
disconnects?

Thanks David!  Disconnects happen when there are interruption in
communication between peers, normally there is ping timeout that happens.
It could be anything from a flaky NW to the system was to busy to respond
to the pings.  My initial take is more towards the ladder as rsync is
absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
try to keep my writes at least 64KB as in my testing that is the smallest
block size I can write with before perf starts to really drop off.  I'll try
something similar in the lab.

Ok I do think that the file being self healed is RCA for what you were seeing.  
Lets look at one of the disconnects:

data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

And in the glustershd.log from the gfs01b_glustershd.log file:

[2015-02-03 20:55:48.001797] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
[2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0
[2015-02-03 20:55:49.343093] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
[2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0
[2015-02-03 20:55:51.465289] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: 
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed metadata selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:51.467098] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:55.258548] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: 
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
[2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed metadata selfheal on 
c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0
[2015-02-03 20:55:55.259980] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541

As you can see the self heal logs are just spammed with files being healed, and 
I looked at a couple of disconnects and I see self heals getting run shortly 
after on the bricks that were down.  Now we need to find the cause of the 
disconnects, I am thinking once the disconnects are resolved the files should 
be properly copied over without SH having to fix things.  Like I said I'll give 
this a go on my lab systems and see if I can repro the disconnects, I'll have 
time to run through it tomorrow.  If in the mean time anyone else has a theory 
/ anything to add here it would be appreciated.

-b


-b


David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
d

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
Should I run my rsync with --block-size = something other than the default? Is 
there an optimal value? I think 128k is the max from my quick search. Didn't 
dig into it throughly though. 

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Ben Turner" 
>> To: "David F. Robinson" 
>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
>> , "Benjamin Turner"
>> , gluster-us...@gluster.org, "Gluster Devel" 
>> 
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> - Original Message -
>>> From: "David F. Robinson" 
>>> To: "Ben Turner" 
>>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
>>> , "Benjamin Turner"
>>> , gluster-us...@gluster.org, "Gluster Devel"
>>> 
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>> 
>>> I'll send you the emails I sent Pranith with the logs. What causes these
>>> disconnects?
>> 
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that happens.
>> It could be anything from a flaky NW to the system was to busy to respond
>> to the pings.  My initial take is more towards the ladder as rsync is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the smallest
>> block size I can write with before perf starts to really drop off.  I'll try
>> something similar in the lab.
> 
> Ok I do think that the file being self healed is RCA for what you were 
> seeing.  Lets look at one of the disconnects:
> 
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> And in the glustershd.log from the gfs01b_glustershd.log file:
> 
> [2015-02-03 20:55:48.001797] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
> [2015-02-03 20:55:49.343093] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
> [2015-02-03 20:55:51.465289] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:51.467098] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:55.258548] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
> [2015-02-03 20:55:55.259980] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> 
> As you can see the self heal logs are just spammed with files being healed, 
> and I looked at a couple of disconnects and I see self heals getting run 
> shortly after on the bricks that were down.  Now we need to find the cause of 
> the disconnects, I am thinking once the disconnects are resolved the files 
> should be properly copied over without SH having to fix things.  Like I said 
> I'll give this a go on my lab systems and see if I can repro the disconnects, 
> I'll have time to run through it tomorrow.  If in the mean time anyone else 
> has a theory / anything to add here it would be appreciated.
> 
> -b
> 
>> -b
>> 
>>> David  (Sent from mobile)
>>> 
>>> ==

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
It was a mix of files from very small to very large. And many terabytes of 
data. Approx 20tb

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 4:55 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Pranith Kumar Karampuri" 
>> To: "Xavier Hernandez" , "David F. Robinson" 
>> , "Benjamin Turner"
>> 
>> Cc: gluster-us...@gluster.org, "Gluster Devel" 
>> Sent: Thursday, February 5, 2015 5:30:04 AM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> 
>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>> I believe David already fixed this. I hope this is the same issue he
>>> told about permissions issue.
>> Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the 
> disconnects to the time that the missing files were healed.  Can you have a 
> look at /var/log/glusterfs/glustershd.log?  That has all of the healed files 
> + timestamps, if we can see a disconnect during the rsync and a self heal of 
> the missing file I think we can safely assume that the disconnects may have 
> caused this.  I'll try this on my test systems, how much data did you rsync?  
> What size ish of files / an idea of the dir layout?  
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files 
> to be missing on the first ls(written to 1 subvol but not the other cause it 
> was down), the ls triggered SH, and thats why the files were there for the 
> second ls be a possible cause here?
> 
> -b
> 
> 
>> Pranith
>>> 
>>> Pranith
 On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
 Is the failure repeatable ? with the same directories ?
 
 It's very weird that the directories appear on the volume when you do
 an 'ls' on the bricks. Could it be that you only made a single 'ls'
 on fuse mount which not showed the directory ? Is it possible that
 this 'ls' triggered a self-heal that repaired the problem, whatever
 it was, and when you did another 'ls' on the fuse mount after the
 'ls' on the bricks, the directories were there ?
 
 The first 'ls' could have healed the files, causing that the
 following 'ls' on the bricks showed the files as if nothing were
 damaged. If that's the case, it's possible that there were some
 disconnections during the copy.
 
 Added Pranith because he knows better replication and self-heal details.
 
 Xavi
 
> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> Distributed/replicated
> 
> Volume Name: homegfs
> Type: Distributed-Replicate
> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> Status: Started
> Number of Bricks: 4 x 2 = 8
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> Options Reconfigured:
> performance.io-thread-count: 32
> performance.cache-size: 128MB
> performance.write-behind-window-size: 128MB
> server.allow-insecure: on
> network.ping-timeout: 10
> storage.owner-gid: 100
> geo-replication.indexing: off
> geo-replication.ignore-pid-check: on
> changelog.changelog: on
> changelog.fsync-interval: 3
> changelog.r

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
I'll send you the emails I sent Pranith with the logs. What causes these 
disconnects?

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 4:55 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Pranith Kumar Karampuri" 
>> To: "Xavier Hernandez" , "David F. Robinson" 
>> , "Benjamin Turner"
>> 
>> Cc: gluster-us...@gluster.org, "Gluster Devel" 
>> Sent: Thursday, February 5, 2015 5:30:04 AM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> 
>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>> I believe David already fixed this. I hope this is the same issue he
>>> told about permissions issue.
>> Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the 
> disconnects to the time that the missing files were healed.  Can you have a 
> look at /var/log/glusterfs/glustershd.log?  That has all of the healed files 
> + timestamps, if we can see a disconnect during the rsync and a self heal of 
> the missing file I think we can safely assume that the disconnects may have 
> caused this.  I'll try this on my test systems, how much data did you rsync?  
> What size ish of files / an idea of the dir layout?  
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files 
> to be missing on the first ls(written to 1 subvol but not the other cause it 
> was down), the ls triggered SH, and thats why the files were there for the 
> second ls be a possible cause here?
> 
> -b
> 
> 
>> Pranith
>>> 
>>> Pranith
 On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
 Is the failure repeatable ? with the same directories ?
 
 It's very weird that the directories appear on the volume when you do
 an 'ls' on the bricks. Could it be that you only made a single 'ls'
 on fuse mount which not showed the directory ? Is it possible that
 this 'ls' triggered a self-heal that repaired the problem, whatever
 it was, and when you did another 'ls' on the fuse mount after the
 'ls' on the bricks, the directories were there ?
 
 The first 'ls' could have healed the files, causing that the
 following 'ls' on the bricks showed the files as if nothing were
 damaged. If that's the case, it's possible that there were some
 disconnections during the copy.
 
 Added Pranith because he knows better replication and self-heal details.
 
 Xavi
 
> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> Distributed/replicated
> 
> Volume Name: homegfs
> Type: Distributed-Replicate
> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> Status: Started
> Number of Bricks: 4 x 2 = 8
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> Options Reconfigured:
> performance.io-thread-count: 32
> performance.cache-size: 128MB
> performance.write-behind-window-size: 128MB
> server.allow-insecure: on
> network.ping-timeout: 10
> storage.owner-gid: 100
> geo-replication.indexing: off
> geo-replication.ignore-pid-check: on
> changelog.changelog: on
> changelog.fsync-interval: 3
> changelog.rollover

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Pranith Kumar Karampuri


On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
I believe David already fixed this. I hope this is the same issue he 
told about permissions issue.

Oops, it is not. I will take a look.

Pranith


Pranith
On 02/05/2015 03:44 PM, Xavier Hernandez wrote:

Is the failure repeatable ? with the same directories ?

It's very weird that the directories appear on the volume when you do 
an 'ls' on the bricks. Could it be that you only made a single 'ls' 
on fuse mount which not showed the directory ? Is it possible that 
this 'ls' triggered a self-heal that repaired the problem, whatever 
it was, and when you did another 'ls' on the fuse mount after the 
'ls' on the bricks, the directories were there ?


The first 'ls' could have healed the files, causing that the 
following 'ls' on the bricks showed the files as if nothing were 
damaged. If that's the case, it's possible that there were some 
disconnections during the copy.


Added Pranith because he knows better replication and self-heal details.

Xavi

On 02/04/2015 07:23 PM, David F. Robinson wrote:

Distributed/replicated

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on


-- Original Message --
From: "Xavier Hernandez" 
To: "David F. Robinson" ; "Benjamin
Turner" 
Cc: "gluster-us...@gluster.org" ; "Gluster
Devel" 
Sent: 2/4/2015 6:03:45 AM
Subject: Re: [Gluster-devel] missing files


On 02/04/2015 01:30 AM, David F. Robinson wrote:

Sorry. Thought about this a little more. I should have been clearer.
The files were on both bricks of the replica, not just one side. So,
both bricks had to have been up... The files/directories just 
don't show

up on the mount.
I was reading and saw a related bug
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
suggested to run:
 find  -d -exec getfattr -h -n trusted.ec.heal {} \;


This command is specific for a dispersed volume. It won't do anything
(aside from the error you are seeing) on a replicated volume.

I think you are using a replicated volume, right ?

In this case I'm not sure what can be happening. Is your volume a pure
replicated one or a distributed-replicated ? on a pure replicated it
doesn't make sense that some entries do not show in an 'ls' when the
file is in both replicas (at least without any error message in the
logs). On a distributed-replicated it could be caused by some problem
while combining contents of each replica set.

What's the configuration of your volume ?

Xavi



I get a bunch of errors for operation not supported:
[root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n
trusted.ec.heal {} \;
find: warning: the -d option is deprecated; please use -depth 
instead,

because the latter is a POSIX-compliant feature.
wks_backup/homer_backup/backup: trusted.ec.heal: Operation not 
supported
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs: trusted.ec.heal: Operation not 
supported

wks_backup/homer_backup: trusted.ec.heal: Operation not supported
-- Original Message --
From: "Benjamin Turner" mailto:bennytu...@gmail.com>>
To: "David F. Robinson" mailto:david.robin...@corvidtec.com>>
Cc: "Gluster Devel" mailto:gluster-devel@gluster.org>>; "gluster-us...@gluster.org"
mailto:gluster-us...@gluster.org>>
Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files
It sounds to me like the files were only copied to one replica, 
werent
there for the initial for the initial ls which triggered a self 
heal,
and were there for the last ls because they were healed. Is there 
any

chance that one of the replicas was down during the rsync? It could
be that you lost a brick during copy or something like that. To
confirm I would lo