Isn't rsync what geo-rep uses?
David (Sent from mobile)
===============================
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com
> On Feb 5, 2015, at 5:41 PM, Ben Turner <btur...@redhat.com> wrote:
>
> ----- Original Message -----
>> From: "Ben Turner" <btur...@redhat.com>
>> To: "David F. Robinson" <david.robin...@corvidtec.com>
>> Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Xavier
Hernandez" <xhernan...@datalab.es>, "Benjamin Turner"
>> <bennytu...@gmail.com>, gluster-us...@gluster.org, "Gluster Devel"
<gluster-devel@gluster.org>
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>
>> ----- Original Message -----
>>> From: "David F. Robinson" <david.robin...@corvidtec.com>
>>> To: "Ben Turner" <btur...@redhat.com>
>>> Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Xavier
Hernandez"
>>> <xhernan...@datalab.es>, "Benjamin Turner"
>>> <bennytu...@gmail.com>, gluster-us...@gluster.org, "Gluster Devel"
>>> <gluster-devel@gluster.org>
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>>
>>> I'll send you the emails I sent Pranith with the logs. What causes
these
>>> disconnects?
>>
>> Thanks David! Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that
happens.
>> It could be anything from a flaky NW to the system was to busy to
respond
>> to the pings. My initial take is more towards the ladder as rsync
is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb
blocks. I
>> try to keep my writes at least 64KB as in my testing that is the
smallest
>> block size I can write with before perf starts to really drop off.
I'll try
>> something similar in the lab.
>
> Ok I do think that the file being self healed is RCA for what you
were seeing. Lets look at one of the disconnects:
>
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
connection from
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
>
> And in the glustershd.log from the gfs01b_glustershd.log file:
>
> [2015-02-03 20:55:48.001797] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448.
source=1 sinks=0
> [2015-02-03 20:55:49.343093] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69.
source=1 sinks=0
> [2015-02-03 20:55:51.465289] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
0-homegfs-replicate-0: performing metadata selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c.
source=1 sinks=0
> [2015-02-03 20:55:51.467098] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c.
source=1 sinks=0
> [2015-02-03 20:55:55.258548] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
0-homegfs-replicate-0: performing metadata selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541.
source=1 sinks=0
> [2015-02-03 20:55:55.259980] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541
>
> As you can see the self heal logs are just spammed with files being
healed, and I looked at a couple of disconnects and I see self heals
getting run shortly after on the bricks that were down. Now we need
to find the cause of the disconnects, I am thinking once the
disconnects are resolved the files should be properly copied over
without SH having to fix things. Like I said I'll give this a go on
my lab systems and see if I can repro the disconnects, I'll have time
to run through it tomorrow. If in the mean time anyone else has a
theory / anything to add here it would be appreciated.
>
> -b
>
>> -b
>>
>>> David (Sent from mobile)
>>>
>>> ===============================
>>> David F. Robinson, Ph.D.
>>> President - Corvid Technologies
>>> 704.799.6944 x101 [office]
>>> 704.252.1310 [cell]
>>> 704.799.7974 [fax]
>>> david.robin...@corvidtec.com
>>> http://www.corvidtechnologies.com
>>>
>>>> On Feb 5, 2015, at 4:55 PM, Ben Turner <btur...@redhat.com>
wrote:
>>>>
>>>> ----- Original Message -----
>>>>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>>>>> To: "Xavier Hernandez" <xhernan...@datalab.es>, "David F.
Robinson"
>>>>> <david.robin...@corvidtec.com>, "Benjamin Turner"
>>>>> <bennytu...@gmail.com>
>>>>> Cc: gluster-us...@gluster.org, "Gluster Devel"
>>>>> <gluster-devel@gluster.org>
>>>>> Sent: Thursday, February 5, 2015 5:30:04 AM
>>>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>>>>
>>>>>
>>>>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>>>>> I believe David already fixed this. I hope this is the same
issue he
>>>>>> told about permissions issue.
>>>>> Oops, it is not. I will take a look.
>>>>
>>>> Yes David exactly like these:
>>>>
>>>> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
>>>> connection from
>>>>
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
>>>> connection from
>>>>
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
>>>> connection from
>>>>
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
>>>> connection from
>>>>
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
>>>> connection from
>>>>
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
>>>>
>>>> You can 100% verify my theory if you can correlate the time on
the
>>>> disconnects to the time that the missing files were healed. Can
you have
>>>> a look at /var/log/glusterfs/glustershd.log? That has all of the
healed
>>>> files + timestamps, if we can see a disconnect during the rsync
and a
>>>> self
>>>> heal of the missing file I think we can safely assume that the
>>>> disconnects
>>>> may have caused this. I'll try this on my test systems, how much
data
>>>> did
>>>> you rsync? What size ish of files / an idea of the dir layout?
>>>>
>>>> @Pranith - Could bricks flapping up and down during the rsync
cause the
>>>> files to be missing on the first ls(written to 1 subvol but not
the other
>>>> cause it was down), the ls triggered SH, and thats why the files
were
>>>> there for the second ls be a possible cause here?
>>>>
>>>> -b
>>>>
>>>>
>>>>> Pranith
>>>>>>
>>>>>> Pranith
>>>>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
>>>>>>> Is the failure repeatable ? with the same directories ?
>>>>>>>
>>>>>>> It's very weird that the directories appear on the volume when
you do
>>>>>>> an 'ls' on the bricks. Could it be that you only made a single
'ls'
>>>>>>> on fuse mount which not showed the directory ? Is it possible
that
>>>>>>> this 'ls' triggered a self-heal that repaired the problem,
whatever
>>>>>>> it was, and when you did another 'ls' on the fuse mount after
the
>>>>>>> 'ls' on the bricks, the directories were there ?
>>>>>>>
>>>>>>> The first 'ls' could have healed the files, causing that the
>>>>>>> following 'ls' on the bricks showed the files as if nothing
were
>>>>>>> damaged. If that's the case, it's possible that there were
some
>>>>>>> disconnections during the copy.
>>>>>>>
>>>>>>> Added Pranith because he knows better replication and
self-heal
>>>>>>> details.
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote:
>>>>>>>> Distributed/replicated
>>>>>>>>
>>>>>>>> Volume Name: homegfs
>>>>>>>> Type: Distributed-Replicate
>>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 4 x 2 = 8
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>> Options Reconfigured:
>>>>>>>> performance.io-thread-count: 32
>>>>>>>> performance.cache-size: 128MB
>>>>>>>> performance.write-behind-window-size: 128MB
>>>>>>>> server.allow-insecure: on
>>>>>>>> network.ping-timeout: 10
>>>>>>>> storage.owner-gid: 100
>>>>>>>> geo-replication.indexing: off
>>>>>>>> geo-replication.ignore-pid-check: on
>>>>>>>> changelog.changelog: on
>>>>>>>> changelog.fsync-interval: 3
>>>>>>>> changelog.rollover-time: 15
>>>>>>>> server.manage-gids: on
>>>>>>>>
>>>>>>>>
>>>>>>>> ------ Original Message ------
>>>>>>>> From: "Xavier Hernandez" <xhernan...@datalab.es>
>>>>>>>> To: "David F. Robinson" <david.robin...@corvidtec.com>;
"Benjamin
>>>>>>>> Turner" <bennytu...@gmail.com>
>>>>>>>> Cc: "gluster-us...@gluster.org" <gluster-us...@gluster.org>;
"Gluster
>>>>>>>> Devel" <gluster-devel@gluster.org>
>>>>>>>> Sent: 2/4/2015 6:03:45 AM
>>>>>>>> Subject: Re: [Gluster-devel] missing files
>>>>>>>>
>>>>>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote:
>>>>>>>>>> Sorry. Thought about this a little more. I should have been
>>>>>>>>>> clearer.
>>>>>>>>>> The files were on both bricks of the replica, not just one
side.
>>>>>>>>>> So,
>>>>>>>>>> both bricks had to have been up... The files/directories
just
>>>>>>>>>> don't show
>>>>>>>>>> up on the mount.
>>>>>>>>>> I was reading and saw a related bug
>>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I
saw it
>>>>>>>>>> suggested to run:
>>>>>>>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal
{} \;
>>>>>>>>>
>>>>>>>>> This command is specific for a dispersed volume. It won't do
>>>>>>>>> anything
>>>>>>>>> (aside from the error you are seeing) on a replicated
volume.
>>>>>>>>>
>>>>>>>>> I think you are using a replicated volume, right ?
>>>>>>>>>
>>>>>>>>> In this case I'm not sure what can be happening. Is your
volume a
>>>>>>>>> pure
>>>>>>>>> replicated one or a distributed-replicated ? on a pure
replicated it
>>>>>>>>> doesn't make sense that some entries do not show in an 'ls'
when the
>>>>>>>>> file is in both replicas (at least without any error message
in the
>>>>>>>>> logs). On a distributed-replicated it could be caused by
some
>>>>>>>>> problem
>>>>>>>>> while combining contents of each replica set.
>>>>>>>>>
>>>>>>>>> What's the configuration of your volume ?
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I get a bunch of errors for operation not supported:
>>>>>>>>>> [root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h
-n
>>>>>>>>>> trusted.ec.heal {} \;
>>>>>>>>>> find: warning: the -d option is deprecated; please use
-depth
>>>>>>>>>> instead,
>>>>>>>>>> because the latter is a POSIX-compliant feature.
>>>>>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation
not
>>>>>>>>>> supported
>>>>>>>>>> wks_backup/homer_backup/logs/2014_05_20.log:
trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs/2014_05_21.log:
trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs/2014_05_18.log:
trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs/2014_05_19.log:
trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs/2014_05_22.log:
trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation
not
>>>>>>>>>> supported
>>>>>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not
supported
>>>>>>>>>> ------ Original Message ------
>>>>>>>>>> From: "Benjamin Turner" <bennytu...@gmail.com
>>>>>>>>>> <mailto:bennytu...@gmail.com>>
>>>>>>>>>> To: "David F. Robinson" <david.robin...@corvidtec.com
>>>>>>>>>> <mailto:david.robin...@corvidtec.com>>
>>>>>>>>>> Cc: "Gluster Devel" <gluster-devel@gluster.org
>>>>>>>>>> <mailto:gluster-devel@gluster.org>>;
"gluster-us...@gluster.org"
>>>>>>>>>> <gluster-us...@gluster.org
<mailto:gluster-us...@gluster.org>>
>>>>>>>>>> Sent: 2/3/2015 7:12:34 PM
>>>>>>>>>> Subject: Re: [Gluster-devel] missing files
>>>>>>>>>>> It sounds to me like the files were only copied to one
replica,
>>>>>>>>>>> werent
>>>>>>>>>>> there for the initial for the initial ls which triggered a
self
>>>>>>>>>>> heal,
>>>>>>>>>>> and were there for the last ls because they were healed.
Is there
>>>>>>>>>>> any
>>>>>>>>>>> chance that one of the replicas was down during the rsync?
It
>>>>>>>>>>> could
>>>>>>>>>>> be that you lost a brick during copy or something like
that. To
>>>>>>>>>>> confirm I would look for disconnects in the brick logs as
well as
>>>>>>>>>>> checking glusterfshd.log to verify the missing files were
actually
>>>>>>>>>>> healed.
>>>>>>>>>>>
>>>>>>>>>>> -b
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
>>>>>>>>>>> <david.robin...@corvidtec.com
>>>>>>>>>>> <mailto:david.robin...@corvidtec.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I rsync'd 20-TB over to my gluster system and noticed
that I
>>>>>>>>>>> had
>>>>>>>>>>> some directories missing even though the rsync completed
>>>>>>>>>>> normally.
>>>>>>>>>>> The rsync logs showed that the missing files were
transferred.
>>>>>>>>>>> I went to the bricks and did an 'ls -al
>>>>>>>>>>> /data/brick*/homegfs/dir/*' the files were on the
bricks.
>>>>>>>>>>> After I
>>>>>>>>>>> did this 'ls', the files then showed up on the FUSE
mounts.
>>>>>>>>>>> 1) Why are the files hidden on the fuse mount?
>>>>>>>>>>> 2) Why does the ls make them show up on the FUSE mount?
>>>>>>>>>>> 3) How can I prevent this from happening again?
>>>>>>>>>>> Note, I also mounted the gluster volume using NFS and
saw the
>>>>>>>>>>> same
>>>>>>>>>>> behavior. The files/directories were not shown until I
did the
>>>>>>>>>>> "ls" on the bricks.
>>>>>>>>>>> David
>>>>>>>>>>> ===============================
>>>>>>>>>>> David F. Robinson, Ph.D.
>>>>>>>>>>> President - Corvid Technologies
>>>>>>>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office]
>>>>>>>>>>> 704.252.1310 <tel:704.252.1310> [cell]
>>>>>>>>>>> 704.799.7974 <tel:704.799.7974> [fax]
>>>>>>>>>>> david.robin...@corvidtec.com
>>>>>>>>>>> <mailto:david.robin...@corvidtec.com>
>>>>>>>>>>> http://www.corvidtechnologies.com
>>>>>>>>>>> <http://www.corvidtechnologies.com/>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Gluster-devel mailing list
>>>>>>>>>>> Gluster-devel@gluster.org
<mailto:Gluster-devel@gluster.org>
>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Gluster-devel mailing list
>>>>>>>>>> Gluster-devel@gluster.org
>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> gluster-us...@gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> gluster-us...@gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>