Re: [Gluster-devel] missing files

2015-02-05 Thread Xavier Hernandez

Is the failure repeatable ? with the same directories ?

It's very weird that the directories appear on the volume when you do an 
'ls' on the bricks. Could it be that you only made a single 'ls' on fuse 
mount which not showed the directory ? Is it possible that this 'ls' 
triggered a self-heal that repaired the problem, whatever it was, and 
when you did another 'ls' on the fuse mount after the 'ls' on the 
bricks, the directories were there ?


The first 'ls' could have healed the files, causing that the following 
'ls' on the bricks showed the files as if nothing were damaged. If 
that's the case, it's possible that there were some disconnections 
during the copy.


Added Pranith because he knows better replication and self-heal details.

Xavi

On 02/04/2015 07:23 PM, David F. Robinson wrote:

Distributed/replicated

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on


-- Original Message --
From: "Xavier Hernandez" 
To: "David F. Robinson" ; "Benjamin
Turner" 
Cc: "gluster-us...@gluster.org" ; "Gluster
Devel" 
Sent: 2/4/2015 6:03:45 AM
Subject: Re: [Gluster-devel] missing files


On 02/04/2015 01:30 AM, David F. Robinson wrote:

Sorry. Thought about this a little more. I should have been clearer.
The files were on both bricks of the replica, not just one side. So,
both bricks had to have been up... The files/directories just don't show
up on the mount.
I was reading and saw a related bug
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
suggested to run:
 find  -d -exec getfattr -h -n trusted.ec.heal {} \;


This command is specific for a dispersed volume. It won't do anything
(aside from the error you are seeing) on a replicated volume.

I think you are using a replicated volume, right ?

In this case I'm not sure what can be happening. Is your volume a pure
replicated one or a distributed-replicated ? on a pure replicated it
doesn't make sense that some entries do not show in an 'ls' when the
file is in both replicas (at least without any error message in the
logs). On a distributed-replicated it could be caused by some problem
while combining contents of each replica set.

What's the configuration of your volume ?

Xavi



I get a bunch of errors for operation not supported:
[root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n
trusted.ec.heal {} \;
find: warning: the -d option is deprecated; please use -depth instead,
because the latter is a POSIX-compliant feature.
wks_backup/homer_backup/backup: trusted.ec.heal: Operation not supported
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: Operation
not supported
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: Operation
not supported
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: Operation
not supported
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: Operation
not supported
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: Operation
not supported
wks_backup/homer_backup/logs: trusted.ec.heal: Operation not supported
wks_backup/homer_backup: trusted.ec.heal: Operation not supported
-- Original Message --
From: "Benjamin Turner" mailto:bennytu...@gmail.com>>
To: "David F. Robinson" mailto:david.robin...@corvidtec.com>>
Cc: "Gluster Devel" mailto:gluster-devel@gluster.org>>; "gluster-us...@gluster.org"
mailto:gluster-us...@gluster.org>>
Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files

It sounds to me like the files were only copied to one replica, werent
there for the initial for the initial ls which triggered a self heal,
and were there for the last ls because they were healed. Is there any
chance that one of the replicas was down during the rsync? It could
be that you lost a brick during copy or something like that. To
confirm I would look for disconnects in the brick logs as well as
checking glusterfshd.log to verify the missing files were actually
healed.

-b

On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
mailto:david.robin...@corvidtec.com>>
wrote:

I rsync'd 20-TB over to my gluster system and noticed 

Re: [Gluster-devel] missing files

2015-02-05 Thread Pranith Kumar Karampuri
I believe David already fixed this. I hope this is the same issue he 
told about permissions issue.


Pranith
On 02/05/2015 03:44 PM, Xavier Hernandez wrote:

Is the failure repeatable ? with the same directories ?

It's very weird that the directories appear on the volume when you do 
an 'ls' on the bricks. Could it be that you only made a single 'ls' on 
fuse mount which not showed the directory ? Is it possible that this 
'ls' triggered a self-heal that repaired the problem, whatever it was, 
and when you did another 'ls' on the fuse mount after the 'ls' on the 
bricks, the directories were there ?


The first 'ls' could have healed the files, causing that the following 
'ls' on the bricks showed the files as if nothing were damaged. If 
that's the case, it's possible that there were some disconnections 
during the copy.


Added Pranith because he knows better replication and self-heal details.

Xavi

On 02/04/2015 07:23 PM, David F. Robinson wrote:

Distributed/replicated

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on


-- Original Message --
From: "Xavier Hernandez" 
To: "David F. Robinson" ; "Benjamin
Turner" 
Cc: "gluster-us...@gluster.org" ; "Gluster
Devel" 
Sent: 2/4/2015 6:03:45 AM
Subject: Re: [Gluster-devel] missing files


On 02/04/2015 01:30 AM, David F. Robinson wrote:

Sorry. Thought about this a little more. I should have been clearer.
The files were on both bricks of the replica, not just one side. So,
both bricks had to have been up... The files/directories just don't 
show

up on the mount.
I was reading and saw a related bug
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
suggested to run:
 find  -d -exec getfattr -h -n trusted.ec.heal {} \;


This command is specific for a dispersed volume. It won't do anything
(aside from the error you are seeing) on a replicated volume.

I think you are using a replicated volume, right ?

In this case I'm not sure what can be happening. Is your volume a pure
replicated one or a distributed-replicated ? on a pure replicated it
doesn't make sense that some entries do not show in an 'ls' when the
file is in both replicas (at least without any error message in the
logs). On a distributed-replicated it could be caused by some problem
while combining contents of each replica set.

What's the configuration of your volume ?

Xavi



I get a bunch of errors for operation not supported:
[root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n
trusted.ec.heal {} \;
find: warning: the -d option is deprecated; please use -depth instead,
because the latter is a POSIX-compliant feature.
wks_backup/homer_backup/backup: trusted.ec.heal: Operation not 
supported
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs: trusted.ec.heal: Operation not supported
wks_backup/homer_backup: trusted.ec.heal: Operation not supported
-- Original Message --
From: "Benjamin Turner" mailto:bennytu...@gmail.com>>
To: "David F. Robinson" mailto:david.robin...@corvidtec.com>>
Cc: "Gluster Devel" mailto:gluster-devel@gluster.org>>; "gluster-us...@gluster.org"
mailto:gluster-us...@gluster.org>>
Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files
It sounds to me like the files were only copied to one replica, 
werent

there for the initial for the initial ls which triggered a self heal,
and were there for the last ls because they were healed. Is there any
chance that one of the replicas was down during the rsync? It could
be that you lost a brick during copy or something like that. To
confirm I would look for disconnects in the brick logs as well as
checking glusterfshd.log to verify the missing files were actu

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Pranith Kumar Karampuri


On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
I believe David already fixed this. I hope this is the same issue he 
told about permissions issue.

Oops, it is not. I will take a look.

Pranith


Pranith
On 02/05/2015 03:44 PM, Xavier Hernandez wrote:

Is the failure repeatable ? with the same directories ?

It's very weird that the directories appear on the volume when you do 
an 'ls' on the bricks. Could it be that you only made a single 'ls' 
on fuse mount which not showed the directory ? Is it possible that 
this 'ls' triggered a self-heal that repaired the problem, whatever 
it was, and when you did another 'ls' on the fuse mount after the 
'ls' on the bricks, the directories were there ?


The first 'ls' could have healed the files, causing that the 
following 'ls' on the bricks showed the files as if nothing were 
damaged. If that's the case, it's possible that there were some 
disconnections during the copy.


Added Pranith because he knows better replication and self-heal details.

Xavi

On 02/04/2015 07:23 PM, David F. Robinson wrote:

Distributed/replicated

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on


-- Original Message --
From: "Xavier Hernandez" 
To: "David F. Robinson" ; "Benjamin
Turner" 
Cc: "gluster-us...@gluster.org" ; "Gluster
Devel" 
Sent: 2/4/2015 6:03:45 AM
Subject: Re: [Gluster-devel] missing files


On 02/04/2015 01:30 AM, David F. Robinson wrote:

Sorry. Thought about this a little more. I should have been clearer.
The files were on both bricks of the replica, not just one side. So,
both bricks had to have been up... The files/directories just 
don't show

up on the mount.
I was reading and saw a related bug
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
suggested to run:
 find  -d -exec getfattr -h -n trusted.ec.heal {} \;


This command is specific for a dispersed volume. It won't do anything
(aside from the error you are seeing) on a replicated volume.

I think you are using a replicated volume, right ?

In this case I'm not sure what can be happening. Is your volume a pure
replicated one or a distributed-replicated ? on a pure replicated it
doesn't make sense that some entries do not show in an 'ls' when the
file is in both replicas (at least without any error message in the
logs). On a distributed-replicated it could be caused by some problem
while combining contents of each replica set.

What's the configuration of your volume ?

Xavi



I get a bunch of errors for operation not supported:
[root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n
trusted.ec.heal {} \;
find: warning: the -d option is deprecated; please use -depth 
instead,

because the latter is a POSIX-compliant feature.
wks_backup/homer_backup/backup: trusted.ec.heal: Operation not 
supported
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs: trusted.ec.heal: Operation not 
supported

wks_backup/homer_backup: trusted.ec.heal: Operation not supported
-- Original Message --
From: "Benjamin Turner" mailto:bennytu...@gmail.com>>
To: "David F. Robinson" mailto:david.robin...@corvidtec.com>>
Cc: "Gluster Devel" mailto:gluster-devel@gluster.org>>; "gluster-us...@gluster.org"
mailto:gluster-us...@gluster.org>>
Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files
It sounds to me like the files were only copied to one replica, 
werent
there for the initial for the initial ls which triggered a self 
heal,
and were there for the last ls because they were healed. Is there 
any

chance that one of the replicas was down during the rsync? It could
be that you lost a brick during copy or something like that. To
confirm I would lo

Re: [Gluster-devel] [Gluster-users] Input/Output Error on Gluster NFS

2015-02-05 Thread Peter Auyeung
Hi Soumya

root@glusterprod001:~# gluster volume info | grep nfs.acl
02/05/15 10:00:05 [ /root ]

Seems like we do not have ACL enabled.

nfs client is a RHEL4 standard NFS client

Thanks
-Peter

From: Soumya Koduri [skod...@redhat.com]
Sent: Wednesday, February 04, 2015 11:28 PM
To: Peter Auyeung; gluster-us...@gluster.org; gluster-devel@gluster.org
Subject: Re: [Gluster-devel] [Gluster-users] Input/Output Error on Gluster NFS

Hi Peter,

Have you disabled Gluster-NFS ACLs .

Please check the option value -
#gluster v info | grep nfs.acl
nfs.acl: ON

Also please provide the details of the nfs-client you are using.
Typically, nfs-clients seem to issue getxattr before doing
setxattr/removexattr operations and return 'ENOTSUPP' incase of ACLs
disabled. But from the strace, looks like the client has issued
'removexattr' of 'system.posix_acl_default' which returned EIO.

Anyways, 'removexattr' should also have returned EOPNOTSUPP instead of EIO.

Thanks,
Soumya

On 02/05/2015 02:31 AM, Peter Auyeung wrote:
> I was trying to copy a directory of files to Gluster via NFS and getting
> permission denied with Input/Output error
>
> ---> r...@bizratedbstandby.bo2.shopzilla.sea (0.00)# cp -pr db /mnt/
> cp: setting permissions for 
> `/mnt/db/full/pr_bizrate_standby_SMLS.F02-01-22-35.d': Input/output error
> cp: setting permissions for 
> `/mnt/db/full/pr_bizrate_standby_logging.F02-02-18-10.b': Input/output error
> cp: setting permissions for `/mnt/db/full/pr_bizrate_SMLS.F02-01-22-35.d': 
> Input/output error
> cp: setting permissions for 
> `/mnt/db/full/pr_bizrate_standby_master.F02-02-22-00': Input/output error
> cp: setting permissions for `/mnt/db/full': Input/output error
> cp: setting permissions for `/mnt/db': Input/output error
>
> Checked gluster nfs.log and etc log and bricks looks clean.
> The files ends up able to copy over with right permission.
>
> Stack trace the copy and seems like it failed on removexattr
>
> removexattr("/mnt/db", "system.posix_acl_default"...) = -1 EIO (Input/output 
> error)
>
> http://pastie.org/9884810
>
> Any Clue?
>
> Thanks
> Peter
>
>
>
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] REMINDER: GlusterFS.next (a.k.a. 4.0) status/planning meeting

2015-02-05 Thread Jeff Darcy
This is *tomorrow* at 12:00 UTC (approximately 15.5 hours from now) in
#gluster-meeting on Freenode.  See you all there!

- Original Message -
> Perhaps it's not obvious to the broader community, but a bunch of people
> have put a bunch of work into various projects under the "4.0" banner.
> Some of the results can be seen in the various feature pages here:
> 
> http://www.gluster.org/community/documentation/index.php/Planning40
> 
> Now that the various subproject feature pages have been updated, it's
> time to get people together and decide what 4.0 is *really* going to be.
> To that end, I'd like to schedule an IRC meeting for February 6 at 12:00
> UTC - that's this Friday, same time as the triage/community meetings but
> on Friday instead of Tuesday/Wednesday.  An initial agenda includes:
> 
> * Introduction and expectation-setting
> 
> * Project-by-project status and planning
> 
> * Discussion of future meeting formats and times
> 
> * Discussion of collaboration tools (e.g. gluster.org wiki or
>   Freedcamp) going forward.
> 
> Anyone with an interest in the future of GlusterFS is welcome to attend.
> This is *not* a Red Hat only effort, tied to Red Hat product needs and
> schedules and strategies.  This is a chance for the community to come
> together and define what the next generation of "distributed file
> systems for the real world" will look like.  I hope to see everyone
> there.
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] failed heal

2015-02-05 Thread Niels de Vos
On Thu, Feb 05, 2015 at 11:21:58AM +0530, Pranith Kumar Karampuri wrote:
> 
> On 02/04/2015 11:52 PM, David F. Robinson wrote:
> >I don't recall if that was before or after my upgrade.
> >I'll forward you an email thread for the current heal issues which are
> >after the 3.6.2 upgrade...
> This is executed after the upgrade on just one machine. 3.6.2 entry locks
> are not compatible with versions <= 3.5.3 and 3.6.1 that is the reason. From
> 3.5.4 and releases >=3.6.2 it should work fine.

Oh, I was not aware of this requirement. Does it mean we should not mix
deployments with these versions (what about 3.4?) any longer? 3.5.4 has
not been released yet, so anyone with a mixed 3.5/3.6.2 environment will
hit these issues? Is this only for the self-heal daemon, or are the
triggered/stat self-heal procedures affected too?

It should be noted *very* clearly in the release notes, and I think an
announcement (email+blog) as a warning/reminder would be good. Could you
get some details and advice written down, please?

Thanks,
Niels


> 
> Pranith
> >David
> >-- Original Message --
> >From: "Pranith Kumar Karampuri"  >>
> >To: "David F. Robinson"  >>; "gluster-us...@gluster.org"
> >mailto:gluster-us...@gluster.org>>; "Gluster
> >Devel" mailto:gluster-devel@gluster.org>>
> >Sent: 2/4/2015 2:33:20 AM
> >Subject: Re: [Gluster-devel] failed heal
> >>
> >>On 02/02/2015 03:34 AM, David F. Robinson wrote:
> >>>I have several files that gluster says it cannot heal. I deleted the
> >>>files from all of the bricks
> >>>(/data/brick0*/hpc_shared/motorsports/gmics/Raven/p3/*) and ran a full
> >>>heal using 'gluster volume heal homegfs full'.  Even after the full
> >>>heal, the entries below still show up.
> >>>How do I clear these?
> >>3.6.1 Had an issue where files undergoing I/O will also be shown in the
> >>output of 'gluster volume heal  info', we addressed that in
> >>3.6.2. Is this output from 3.6.1 by any chance?
> >>
> >>Pranith
> >>>[root@gfs01a ~]# gluster volume heal homegfs info
> >>>Gathering list of entries to be healed on volume homegfs has been
> >>>successful
> >>>Brick gfsib01a.corvidtec.com:/data/brick01a/homegfs
> >>>Number of entries: 10
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/Movies
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/.Convrg.swp
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke
> >>>Brick gfsib01b.corvidtec.com:/data/brick01b/homegfs
> >>>Number of entries: 2
> >>>
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke
> >>>Brick gfsib01a.corvidtec.com:/data/brick02a/homegfs
> >>>Number of entries: 7
> >>>
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/PICTURES/.tmpcheck
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/PICTURES
> >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/Movies
> >>>
> >>>
> >>>
> >>>Brick gfsib01b.corvidtec.com:/data/brick02b/homegfs
> >>>Number of entries: 0
> >>>Brick gfsib02a.corvidtec.com:/data/brick01a/homegfs
> >>>Number of entries: 0
> >>>Brick gfsib02b.corvidtec.com:/data/brick01b/homegfs
> >>>Number of entries: 0
> >>>Brick gfsib02a.corvidtec.com:/data/brick02a/homegfs
> >>>Number of entries: 0
> >>>Brick gfsib02b.corvidtec.com:/data/brick02b/homegfs
> >>>Number of entries: 0
> >>>===
> >>>David F. Robinson, Ph.D.
> >>>President - Corvid Technologies
> >>>704.799.6944 x101 [office]
> >>>704.252.1310 [cell]
> >>>704.799.7974 [fax]
> >>>david.robin...@corvidtec.com 
> >>>http://www.corvidtechnologies.com 
> >>>
> >>>
> >>>___
> >>>Gluster-devel mailing list
> >>>Gluster-devel@gluster.org
> >>>http://www.gluster.org/mailman/listinfo/gluster-devel
> >>
> 

> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel



pgpssRXZEETwm.pgp
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
I'll send you the emails I sent Pranith with the logs. What causes these 
disconnects?

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 4:55 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Pranith Kumar Karampuri" 
>> To: "Xavier Hernandez" , "David F. Robinson" 
>> , "Benjamin Turner"
>> 
>> Cc: gluster-us...@gluster.org, "Gluster Devel" 
>> Sent: Thursday, February 5, 2015 5:30:04 AM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> 
>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>> I believe David already fixed this. I hope this is the same issue he
>>> told about permissions issue.
>> Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the 
> disconnects to the time that the missing files were healed.  Can you have a 
> look at /var/log/glusterfs/glustershd.log?  That has all of the healed files 
> + timestamps, if we can see a disconnect during the rsync and a self heal of 
> the missing file I think we can safely assume that the disconnects may have 
> caused this.  I'll try this on my test systems, how much data did you rsync?  
> What size ish of files / an idea of the dir layout?  
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files 
> to be missing on the first ls(written to 1 subvol but not the other cause it 
> was down), the ls triggered SH, and thats why the files were there for the 
> second ls be a possible cause here?
> 
> -b
> 
> 
>> Pranith
>>> 
>>> Pranith
 On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
 Is the failure repeatable ? with the same directories ?
 
 It's very weird that the directories appear on the volume when you do
 an 'ls' on the bricks. Could it be that you only made a single 'ls'
 on fuse mount which not showed the directory ? Is it possible that
 this 'ls' triggered a self-heal that repaired the problem, whatever
 it was, and when you did another 'ls' on the fuse mount after the
 'ls' on the bricks, the directories were there ?
 
 The first 'ls' could have healed the files, causing that the
 following 'ls' on the bricks showed the files as if nothing were
 damaged. If that's the case, it's possible that there were some
 disconnections during the copy.
 
 Added Pranith because he knows better replication and self-heal details.
 
 Xavi
 
> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> Distributed/replicated
> 
> Volume Name: homegfs
> Type: Distributed-Replicate
> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> Status: Started
> Number of Bricks: 4 x 2 = 8
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> Options Reconfigured:
> performance.io-thread-count: 32
> performance.cache-size: 128MB
> performance.write-behind-window-size: 128MB
> server.allow-insecure: on
> network.ping-timeout: 10
> storage.owner-gid: 100
> geo-replication.indexing: off
> geo-replication.ignore-pid-check: on
> changelog.changelog: on
> changelog.fsync-interval: 3
> changelog.rollover

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
It was a mix of files from very small to very large. And many terabytes of 
data. Approx 20tb

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 4:55 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Pranith Kumar Karampuri" 
>> To: "Xavier Hernandez" , "David F. Robinson" 
>> , "Benjamin Turner"
>> 
>> Cc: gluster-us...@gluster.org, "Gluster Devel" 
>> Sent: Thursday, February 5, 2015 5:30:04 AM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> 
>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>> I believe David already fixed this. I hope this is the same issue he
>>> told about permissions issue.
>> Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from 
> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the 
> disconnects to the time that the missing files were healed.  Can you have a 
> look at /var/log/glusterfs/glustershd.log?  That has all of the healed files 
> + timestamps, if we can see a disconnect during the rsync and a self heal of 
> the missing file I think we can safely assume that the disconnects may have 
> caused this.  I'll try this on my test systems, how much data did you rsync?  
> What size ish of files / an idea of the dir layout?  
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files 
> to be missing on the first ls(written to 1 subvol but not the other cause it 
> was down), the ls triggered SH, and thats why the files were there for the 
> second ls be a possible cause here?
> 
> -b
> 
> 
>> Pranith
>>> 
>>> Pranith
 On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
 Is the failure repeatable ? with the same directories ?
 
 It's very weird that the directories appear on the volume when you do
 an 'ls' on the bricks. Could it be that you only made a single 'ls'
 on fuse mount which not showed the directory ? Is it possible that
 this 'ls' triggered a self-heal that repaired the problem, whatever
 it was, and when you did another 'ls' on the fuse mount after the
 'ls' on the bricks, the directories were there ?
 
 The first 'ls' could have healed the files, causing that the
 following 'ls' on the bricks showed the files as if nothing were
 damaged. If that's the case, it's possible that there were some
 disconnections during the copy.
 
 Added Pranith because he knows better replication and self-heal details.
 
 Xavi
 
> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> Distributed/replicated
> 
> Volume Name: homegfs
> Type: Distributed-Replicate
> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> Status: Started
> Number of Bricks: 4 x 2 = 8
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> Options Reconfigured:
> performance.io-thread-count: 32
> performance.cache-size: 128MB
> performance.write-behind-window-size: 128MB
> server.allow-insecure: on
> network.ping-timeout: 10
> storage.owner-gid: 100
> geo-replication.indexing: off
> geo-replication.ignore-pid-check: on
> changelog.changelog: on
> changelog.fsync-interval: 3
> changelog.r

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
Should I run my rsync with --block-size = something other than the default? Is 
there an optimal value? I think 128k is the max from my quick search. Didn't 
dig into it throughly though. 

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Ben Turner" 
>> To: "David F. Robinson" 
>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
>> , "Benjamin Turner"
>> , gluster-us...@gluster.org, "Gluster Devel" 
>> 
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> - Original Message -
>>> From: "David F. Robinson" 
>>> To: "Ben Turner" 
>>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
>>> , "Benjamin Turner"
>>> , gluster-us...@gluster.org, "Gluster Devel"
>>> 
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>> 
>>> I'll send you the emails I sent Pranith with the logs. What causes these
>>> disconnects?
>> 
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that happens.
>> It could be anything from a flaky NW to the system was to busy to respond
>> to the pings.  My initial take is more towards the ladder as rsync is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the smallest
>> block size I can write with before perf starts to really drop off.  I'll try
>> something similar in the lab.
> 
> Ok I do think that the file being self healed is RCA for what you were 
> seeing.  Lets look at one of the disconnects:
> 
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> And in the glustershd.log from the gfs01b_glustershd.log file:
> 
> [2015-02-03 20:55:48.001797] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
> [2015-02-03 20:55:49.343093] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
> [2015-02-03 20:55:51.465289] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:51.467098] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:55.258548] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
> [2015-02-03 20:55:55.259980] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> 
> As you can see the self heal logs are just spammed with files being healed, 
> and I looked at a couple of disconnects and I see self heals getting run 
> shortly after on the bricks that were down.  Now we need to find the cause of 
> the disconnects, I am thinking once the disconnects are resolved the files 
> should be properly copied over without SH having to fix things.  Like I said 
> I'll give this a go on my lab systems and see if I can repro the disconnects, 
> I'll have time to run through it tomorrow.  If in the mean time anyone else 
> has a theory / anything to add here it would be appreciated.
> 
> -b
> 
>> -b
>> 
>>> David  (Sent from mobile)
>>> 
>>> ==

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Joe Julian

Out of curiosity, are you using --inplace?

On 02/05/2015 02:59 PM, David F. Robinson wrote:

Should I run my rsync with --block-size = something other than the default? Is 
there an optimal value? I think 128k is the max from my quick search. Didn't 
dig into it throughly though.

David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com


On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:

- Original Message -

From: "Ben Turner" 
To: "David F. Robinson" 
Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
, "Benjamin Turner"
, gluster-us...@gluster.org, "Gluster Devel" 

Sent: Thursday, February 5, 2015 5:22:26 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

- Original Message -

From: "David F. Robinson" 
To: "Ben Turner" 
Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
, "Benjamin Turner"
, gluster-us...@gluster.org, "Gluster Devel"

Sent: Thursday, February 5, 2015 5:01:13 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

I'll send you the emails I sent Pranith with the logs. What causes these
disconnects?

Thanks David!  Disconnects happen when there are interruption in
communication between peers, normally there is ping timeout that happens.
It could be anything from a flaky NW to the system was to busy to respond
to the pings.  My initial take is more towards the ladder as rsync is
absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
try to keep my writes at least 64KB as in my testing that is the smallest
block size I can write with before perf starts to really drop off.  I'll try
something similar in the lab.

Ok I do think that the file being self healed is RCA for what you were seeing.  
Lets look at one of the disconnects:

data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

And in the glustershd.log from the gfs01b_glustershd.log file:

[2015-02-03 20:55:48.001797] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
[2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0
[2015-02-03 20:55:49.343093] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
[2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0
[2015-02-03 20:55:51.465289] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: 
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed metadata selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:51.467098] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed entry selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:55.258548] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: 
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
[2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0: Completed metadata selfheal on 
c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0
[2015-02-03 20:55:55.259980] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541

As you can see the self heal logs are just spammed with files being healed, and 
I looked at a couple of disconnects and I see self heals getting run shortly 
after on the bricks that were down.  Now we need to find the cause of the 
disconnects, I am thinking once the disconnects are resolved the files should 
be properly copied over without SH having to fix things.  Like I said I'll give 
this a go on my lab systems and see if I can repro the disconnects, I'll have 
time to run through it tomorrow.  If in the mean time anyone else has a theory 
/ anything to add here it would be appreciated.

-b


-b


David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
d

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
Isn't rsync what geo-rep uses?

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> 
> - Original Message -
>> From: "Ben Turner" 
>> To: "David F. Robinson" 
>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez" 
>> , "Benjamin Turner"
>> , gluster-us...@gluster.org, "Gluster Devel" 
>> 
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> - Original Message -
>>> From: "David F. Robinson" 
>>> To: "Ben Turner" 
>>> Cc: "Pranith Kumar Karampuri" , "Xavier Hernandez"
>>> , "Benjamin Turner"
>>> , gluster-us...@gluster.org, "Gluster Devel"
>>> 
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>> 
>>> I'll send you the emails I sent Pranith with the logs. What causes these
>>> disconnects?
>> 
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that happens.
>> It could be anything from a flaky NW to the system was to busy to respond
>> to the pings.  My initial take is more towards the ladder as rsync is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the smallest
>> block size I can write with before perf starts to really drop off.  I'll try
>> something similar in the lab.
> 
> Ok I do think that the file being self healed is RCA for what you were 
> seeing.  Lets look at one of the disconnects:
> 
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
> from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> And in the glustershd.log from the gfs01b_glustershd.log file:
> 
> [2015-02-03 20:55:48.001797] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
> [2015-02-03 20:55:49.343093] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
> [2015-02-03 20:55:51.465289] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:51.467098] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed entry selfheal on 
> 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
> [2015-02-03 20:55:55.258548] I 
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
> 0-homegfs-replicate-0: performing metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
> 0-homegfs-replicate-0: Completed metadata selfheal on 
> c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
> [2015-02-03 20:55:55.259980] I 
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
> performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> 
> As you can see the self heal logs are just spammed with files being healed, 
> and I looked at a couple of disconnects and I see self heals getting run 
> shortly after on the bricks that were down.  Now we need to find the cause of 
> the disconnects, I am thinking once the disconnects are resolved the files 
> should be properly copied over without SH having to fix things.  Like I said 
> I'll give this a go on my lab systems and see if I can repro the disconnects, 
> I'll have time to run through it tomorrow.  If in the mean time anyone else 
> has a theory / anything to add here it would be appreciated.
> 
> -b
> 
>> -b
>> 
>>> David  (Sent from mobile)
>>> 
>>> ===
>>> David F. Robinson, Ph.D.
>>> President - Corvid Technologies
>>> 704.799.6944 x101 [office]
>>> 704.252.1310  [cell]
>>> 704.799.7974   

Re: [Gluster-devel] Gluster 3.6.2 On Xeon Phi

2015-02-05 Thread Rudra Siva
Rafi,

Sorry it took me some time - I had to merge these with some of my
changes - the scif0 (iWARP) does not support SRQ (max_srq : 0) so have
changed some of the code to use QP instead - can provide those if
there is interest after this is stable.

Here's the good -

The performance with the patches is better than without (esp.
http://review.gluster.org/#/c/9327/).

The bad - glusterfsd crashes for large files so it's difficult to get
some decent benchmark numbers - small ones look good - trying to
understand the patch at this time. Looks like this code comes from
9327 as well.

Can you please review the reset of mr_count?

Info from gdb is as follows - if you need more or something jumps out
please feel free to let me know.

(gdb) p *post
$16 = {next = 0x7fffe003b280, prev = 0x7fffe0037cc0, mr =
0x7fffe0037fb0, buf = 0x7fffe0096000 "\005\004", buf_size = 4096, aux
= 0 '\000',
  reused = 1, device = 0x7fffe00019c0, type = GF_RDMA_RECV_POST, ctx =
{mr = {0x7fffe0003020, 0x7fffc8005f20, 0x7fffc8000aa0, 0x7fffc80030c0,
  0x7fffc8002d70, 0x7fffc8008bb0, 0x7fffc8008bf0, 0x7fffc8002cd0},
mr_count = -939493456, vector = {{iov_base = 0x77fd6000,
iov_len = 112}, {iov_base = 0x7fffbf14, iov_len = 131072},
{iov_base = 0x0, iov_len = 0} }, count = 2,
iobref = 0x7fffc8001670, hdr_iobuf = 0x61d710, is_request = 0
'\000', gf_rdma_reads = 1, reply_info = 0x0}, refcount = 1, lock = {
__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' , __align = 0}}

(gdb) bt
#0  0x7fffe7142681 in __gf_rdma_register_local_mr_for_rdma
(peer=0x7fffe0001800, vector=0x7fffe003b108, count=1,
ctx=0x7fffe003b0b0)
at rdma.c:2255
#1  0x7fffe7145acd in gf_rdma_do_reads (peer=0x7fffe0001800,
post=0x7fffe003b070, readch=0x7fffe0096010) at rdma.c:3609
#2  0x7fffe714656e in gf_rdma_recv_request (peer=0x7fffe0001800,
post=0x7fffe003b070, readch=0x7fffe0096010) at rdma.c:3859
#3  0x7fffe714691d in gf_rdma_process_recv (peer=0x7fffe0001800,
wc=0x7fffceffcd20) at rdma.c:3967
#4  0x7fffe7146e7d in gf_rdma_recv_completion_proc
(data=0x7fffe0002b30) at rdma.c:4114
#5  0x772cfdf3 in start_thread () from /lib64/libpthread.so.0
#6  0x76c403dd in clone () from /lib64/libc.so.6

On Fri, Jan 30, 2015 at 7:11 AM, Mohammed Rafi K C  wrote:
>
> On 01/29/2015 06:13 PM, Rudra Siva wrote:
>> Hi,
>>
>> Have been able to get Gluster running on Intel's MIC platform. The
>> only code change to Gluster source was an unresolved yylex (I am not
>> really sure why that was coming up - may be someone more familiar with
>> it's use in Gluster can answer).
>>
>> At the step for compiling the binaries (glusterd, glusterfsd,
>> glusterfs, glfsheal)  build breaks with an unresolved yylex error.
>>
>> For now have a routine yylex that simply calls graphyylex - I don't
>> know if this is even correct however mount functions.
>>
>> GCC - 4.7 (it's an oddity, latest GCC is missing the Phi patches)
>>
>> flex --version
>> flex 2.5.39
>>
>> bison --version
>> bison (GNU Bison) 3.0
>>
>> I'm still working on testing the RDMA and Infiniband support and can
>> make notes, numbers available when that is complete.
> There are couple of rdma performance related patches under review. If
> you could make use of those patches, I hope that will give a performance
> enhancement.
>
> [1] : http://review.gluster.org/#/c/9329/
> [2] : http://review.gluster.org/#/c/9321/
> [3] : http://review.gluster.org/#/c/9327/
> [4] : http://review.gluster.org/#/c/9506/
>
> Let me know if you need any clarification.
>
> Regards!
> Rafi KC
>>
>



-- 
-Siva
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] failed heal

2015-02-05 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Niels de Vos" 
> To: "Pranith Kumar Karampuri" 
> Cc: gluster-us...@gluster.org, "Gluster Devel" 
> Sent: Friday, February 6, 2015 2:32:36 AM
> Subject: Re: [Gluster-devel] failed heal
> 
> On Thu, Feb 05, 2015 at 11:21:58AM +0530, Pranith Kumar Karampuri wrote:
> > 
> > On 02/04/2015 11:52 PM, David F. Robinson wrote:
> > >I don't recall if that was before or after my upgrade.
> > >I'll forward you an email thread for the current heal issues which are
> > >after the 3.6.2 upgrade...
> > This is executed after the upgrade on just one machine. 3.6.2 entry locks
> > are not compatible with versions <= 3.5.3 and 3.6.1 that is the reason.
> > From
> > 3.5.4 and releases >=3.6.2 it should work fine.
> 
> Oh, I was not aware of this requirement. Does it mean we should not mix
> deployments with these versions (what about 3.4?) any longer? 3.5.4 has
> not been released yet, so anyone with a mixed 3.5/3.6.2 environment will
> hit these issues? Is this only for the self-heal daemon, or are the
> triggered/stat self-heal procedures affected too?
> 
> It should be noted *very* clearly in the release notes, and I think an
> announcement (email+blog) as a warning/reminder would be good. Could you
> get some details and advice written down, please?
Will do today.

Pranith
> 
> Thanks,
> Niels
> 
> 
> > 
> > Pranith
> > >David
> > >-- Original Message --
> > >From: "Pranith Kumar Karampuri"  > >>
> > >To: "David F. Robinson"  > >>; "gluster-us...@gluster.org"
> > >mailto:gluster-us...@gluster.org>>; "Gluster
> > >Devel" mailto:gluster-devel@gluster.org>>
> > >Sent: 2/4/2015 2:33:20 AM
> > >Subject: Re: [Gluster-devel] failed heal
> > >>
> > >>On 02/02/2015 03:34 AM, David F. Robinson wrote:
> > >>>I have several files that gluster says it cannot heal. I deleted the
> > >>>files from all of the bricks
> > >>>(/data/brick0*/hpc_shared/motorsports/gmics/Raven/p3/*) and ran a full
> > >>>heal using 'gluster volume heal homegfs full'.  Even after the full
> > >>>heal, the entries below still show up.
> > >>>How do I clear these?
> > >>3.6.1 Had an issue where files undergoing I/O will also be shown in the
> > >>output of 'gluster volume heal  info', we addressed that in
> > >>3.6.2. Is this output from 3.6.1 by any chance?
> > >>
> > >>Pranith
> > >>>[root@gfs01a ~]# gluster volume heal homegfs info
> > >>>Gathering list of entries to be healed on volume homegfs has been
> > >>>successful
> > >>>Brick gfsib01a.corvidtec.com:/data/brick01a/homegfs
> > >>>Number of entries: 10
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/Movies
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/.Convrg.swp
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke
> > >>>Brick gfsib01b.corvidtec.com:/data/brick01b/homegfs
> > >>>Number of entries: 2
> > >>>
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke
> > >>>Brick gfsib01a.corvidtec.com:/data/brick02a/homegfs
> > >>>Number of entries: 7
> > >>>
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/PICTURES/.tmpcheck
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/PICTURES
> > >>>/hpc_shared/motorsports/gmics/Raven/p3/70_rke/Movies
> > >>>
> > >>>
> > >>>
> > >>>Brick gfsib01b.corvidtec.com:/data/brick02b/homegfs
> > >>>Number of entries: 0
> > >>>Brick gfsib02a.corvidtec.com:/data/brick01a/homegfs
> > >>>Number of entries: 0
> > >>>Brick gfsib02b.corvidtec.com:/data/brick01b/homegfs
> > >>>Number of entries: 0
> > >>>Brick gfsib02a.corvidtec.com:/data/brick02a/homegfs
> > >>>Number of entries: 0
> > >>>Brick gfsib02b.corvidtec.com:/data/brick02b/homegfs
> > >>>Number of entries: 0
> > >>>===
> > >>>David F. Robinson, Ph.D.
> > >>>President - Corvid Technologies
> > >>>704.799.6944 x101 [office]
> > >>>704.252.1310 [cell]
> > >>>704.799.7974 [fax]
> > >>>david.robin...@corvidtec.com 
> > >>>http://www.corvidtechnologies.com 
> > >>>
> > >>>
> > >>>___
> > >>>Gluster-devel mailing list
> > >>>Gluster-devel@gluster.org
> > >>>http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>
> > 
> 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Pranith Kumar Karampuri


- Original Message -
> From: "Ben Turner" 
> To: "Pranith Kumar Karampuri" , "David F. Robinson" 
> 
> Cc: "Xavier Hernandez" , "Benjamin Turner" 
> , gluster-us...@gluster.org,
> "Gluster Devel" 
> Sent: Friday, February 6, 2015 3:25:28 AM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> - Original Message -
> > From: "Pranith Kumar Karampuri" 
> > To: "Xavier Hernandez" , "David F. Robinson"
> > , "Benjamin Turner"
> > 
> > Cc: gluster-us...@gluster.org, "Gluster Devel" 
> > Sent: Thursday, February 5, 2015 5:30:04 AM
> > Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > 
> > 
> > On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> > > I believe David already fixed this. I hope this is the same issue he
> > > told about permissions issue.
> > Oops, it is not. I will take a look.
> 
> Yes David exactly like these:
> 
> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> You can 100% verify my theory if you can correlate the time on the
> disconnects to the time that the missing files were healed.  Can you have a
> look at /var/log/glusterfs/glustershd.log?  That has all of the healed files
> + timestamps, if we can see a disconnect during the rsync and a self heal of
> the missing file I think we can safely assume that the disconnects may have
> caused this.  I'll try this on my test systems, how much data did you rsync?
> What size ish of files / an idea of the dir layout?
> 
> @Pranith - Could bricks flapping up and down during the rsync cause the files
> to be missing on the first ls(written to 1 subvol but not the other cause it
> was down), the ls triggered SH, and thats why the files were there for the
> second ls be a possible cause here?

No it would be a bug. Afr should serve the directory contents from the brick 
with those files.

> 
> -b
> 
>  
> > Pranith
> > >
> > > Pranith
> > > On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> > >> Is the failure repeatable ? with the same directories ?
> > >>
> > >> It's very weird that the directories appear on the volume when you do
> > >> an 'ls' on the bricks. Could it be that you only made a single 'ls'
> > >> on fuse mount which not showed the directory ? Is it possible that
> > >> this 'ls' triggered a self-heal that repaired the problem, whatever
> > >> it was, and when you did another 'ls' on the fuse mount after the
> > >> 'ls' on the bricks, the directories were there ?
> > >>
> > >> The first 'ls' could have healed the files, causing that the
> > >> following 'ls' on the bricks showed the files as if nothing were
> > >> damaged. If that's the case, it's possible that there were some
> > >> disconnections during the copy.
> > >>
> > >> Added Pranith because he knows better replication and self-heal details.
> > >>
> > >> Xavi
> > >>
> > >> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> > >>> Distributed/replicated
> > >>>
> > >>> Volume Name: homegfs
> > >>> Type: Distributed-Replicate
> > >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> > >>> Status: Started
> > >>> Number of Bricks: 4 x 2 = 8
> > >>> Transport-type: tcp
> > >>> Bricks:
> > >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> > >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> > >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> > >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> > >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> > >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> > >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> > >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> > >>> Options Reconfigured:
> > >>> performance.io-thread-count: 32
> > >>> performance.cache-size: 128MB
> > >>> performance.write-behind-window-size: 128MB
> > >>> server.allow-insecure: on
> > >>> network.ping-timeout: 10
> > >>> storage.owner-gid: 100
> > >>> geo-replication.indexing: off
> > >>> geo-replication.ignore-pid-c

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread Benjamin Turner
Correct!  I have seen(back in the day, its been 3ish years since I have
seen it) having say 50+ volumes each with a geo rep session take system
load levels to the point where pings couldn't be serviced within the ping
timeout.  So it is known to happen but there has been alot of work in the
geo rep space to help here, some of which is discussed:

https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50

(think tar + ssh and other fixes)Your symptoms remind me of that case of
50+ geo repd volumes, thats why I mentioned it from the start.  My current
shoot from the hip theory is when rsyncing all that data the servers got
too busy to service the pings and it lead to disconnects.  This is common
across all of the clustering / distributed software I have worked on, if
the system gets too busy to service heartbeat within the timeout things go
crazy(think fork bomb on a single host).  Now this could be a case of me
putting symptoms from an old issue into what you are describing, but thats
where my head is at.  If I'm correct I should be able to repro using a
similar workload.  I think that the multi threaded epoll changes that
_just_ landed in master will help resolve this, but they are so new I
haven't been able to test this.  I'll know more when I get a chance to test
tomorrow.

-b

On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson <
david.robin...@corvidtec.com> wrote:

> Isn't rsync what geo-rep uses?
>
> David  (Sent from mobile)
>
> ===
> David F. Robinson, Ph.D.
> President - Corvid Technologies
> 704.799.6944 x101 [office]
> 704.252.1310  [cell]
> 704.799.7974  [fax]
> david.robin...@corvidtec.com
> http://www.corvidtechnologies.com
>
> > On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
> >
> > - Original Message -
> >> From: "Ben Turner" 
> >> To: "David F. Robinson" 
> >> Cc: "Pranith Kumar Karampuri" , "Xavier
> Hernandez" , "Benjamin Turner"
> >> , gluster-us...@gluster.org, "Gluster Devel" <
> gluster-devel@gluster.org>
> >> Sent: Thursday, February 5, 2015 5:22:26 PM
> >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >>
> >> - Original Message -
> >>> From: "David F. Robinson" 
> >>> To: "Ben Turner" 
> >>> Cc: "Pranith Kumar Karampuri" , "Xavier
> Hernandez"
> >>> , "Benjamin Turner"
> >>> , gluster-us...@gluster.org, "Gluster Devel"
> >>> 
> >>> Sent: Thursday, February 5, 2015 5:01:13 PM
> >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >>>
> >>> I'll send you the emails I sent Pranith with the logs. What causes
> these
> >>> disconnects?
> >>
> >> Thanks David!  Disconnects happen when there are interruption in
> >> communication between peers, normally there is ping timeout that
> happens.
> >> It could be anything from a flaky NW to the system was to busy to
> respond
> >> to the pings.  My initial take is more towards the ladder as rsync is
> >> absolutely the worst use case for gluster - IIRC it writes in 4kb
> blocks.  I
> >> try to keep my writes at least 64KB as in my testing that is the
> smallest
> >> block size I can write with before perf starts to really drop off.
> I'll try
> >> something similar in the lab.
> >
> > Ok I do think that the file being self healed is RCA for what you were
> seeing.  Lets look at one of the disconnects:
> >
> > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection
> from
> gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> >
> > And in the glustershd.log from the gfs01b_glustershd.log file:
> >
> > [2015-02-03 20:55:48.001797] I
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> > [2015-02-03 20:55:49.341996] I
> [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1
> sinks=0
> > [2015-02-03 20:55:49.343093] I
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> > [2015-02-03 20:55:50.463652] I
> [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1
> sinks=0
> > [2015-02-03 20:55:51.465289] I
> [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
> 0-homegfs-replicate-0: performing metadata selfheal on
> 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:51.466515] I
> [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0:
> Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c.
> source=1 sinks=0
> > [2015-02-03 20:55:51.467098] I
> [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
> performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> > [2015-02-03 20:55:55.257808] I
> [afr-self-heal-common.c:476:afr_

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson

copy that.  Thanks for looking into the issue.

David


-- Original Message --
From: "Benjamin Turner" 
To: "David F. Robinson" 
Cc: "Ben Turner" ; "Pranith Kumar Karampuri" 
; "Xavier Hernandez" ; 
"gluster-us...@gluster.org" ; "Gluster Devel" 


Sent: 2/5/2015 9:05:43 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

Correct!  I have seen(back in the day, its been 3ish years since I have 
seen it) having say 50+ volumes each with a geo rep session take system 
load levels to the point where pings couldn't be serviced within the 
ping timeout.  So it is known to happen but there has been alot of work 
in the geo rep space to help here, some of which is discussed:


https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50

(think tar + ssh and other fixes)Your symptoms remind me of that case 
of 50+ geo repd volumes, thats why I mentioned it from the start.  My 
current shoot from the hip theory is when rsyncing all that data the 
servers got too busy to service the pings and it lead to disconnects.  
This is common across all of the clustering / distributed software I 
have worked on, if the system gets too busy to service heartbeat within 
the timeout things go crazy(think fork bomb on a single host).  Now 
this could be a case of me putting symptoms from an old issue into what 
you are describing, but thats where my head is at.  If I'm correct I 
should be able to repro using a similar workload.  I think that the 
multi threaded epoll changes that _just_ landed in master will help 
resolve this, but they are so new I haven't been able to test this.  
I'll know more when I get a chance to test tomorrow.


-b

On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson 
 wrote:

Isn't rsync what geo-rep uses?

David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

> On Feb 5, 2015, at 5:41 PM, Ben Turner  wrote:
>
> - Original Message -
>> From: "Ben Turner" 
>> To: "David F. Robinson" 
>> Cc: "Pranith Kumar Karampuri" , "Xavier 
Hernandez" , "Benjamin Turner"
>> , gluster-us...@gluster.org, "Gluster Devel" 


>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>
>> - Original Message -
>>> From: "David F. Robinson" 
>>> To: "Ben Turner" 
>>> Cc: "Pranith Kumar Karampuri" , "Xavier 
Hernandez"

>>> , "Benjamin Turner"
>>> , gluster-us...@gluster.org, "Gluster Devel"
>>> 
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>>
>>> I'll send you the emails I sent Pranith with the logs. What causes 
these

>>> disconnects?
>>
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that 
happens.
>> It could be anything from a flaky NW to the system was to busy to 
respond
>> to the pings.  My initial take is more towards the ladder as rsync 
is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb 
blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the 
smallest
>> block size I can write with before perf starts to really drop off.  
I'll try

>> something similar in the lab.
>
> Ok I do think that the file being self healed is RCA for what you 
were seeing.  Lets look at one of the disconnects:

>
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

>
> And in the glustershd.log from the gfs01b_glustershd.log file:
>
> [2015-02-03 20:55:48.001797] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0: performing entry selfheal on 
6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. 
source=1 sinks=0
> [2015-02-03 20:55:49.343093] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0: performing entry selfheal on 
792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. 
source=1 sinks=0
> [2015-02-03 20:55:51.465289] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
0-homegfs-replicate-0: performing metadata selfheal on 
403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. 
source=1 sinks=0
> [2015-02-03 20:55:51.467098] I 
[afr-self-heal-entry.c:554:afr_selfheal_

Re: [Gluster-devel] missing files

2015-02-05 Thread David F. Robinson
Not repeatable.  Once it shows up, it stays there.  I sent some other 
strange behavior I am seeing to Pranith earlier this evening.  Attached 
below...


David

Another issue I am having that might be related is that I cannot delete 
some directories. It complains that the directories are not empty. But 
when I list them out, there is nothing there.
However, if I know of the name of the directory, I can cd into it and 
see the files.


[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# pwd
/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor

[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# ls -al
total 0
drwxrws--x 7 root root 449 Feb 4 18:12 .
drwxrwx--- 3 root root 200 Feb 4 18:19 ..
drwxrws--- 3 root root 41 Feb 4 18:12 References
drwxrws--x 4 root root 54 Feb 4 18:12 Testing
drwxrws--- 4 root root 51 Feb 4 18:12 Velodyne
drwxrws--x 4 root root 38 Feb 4 18:12 progress_reports

[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# rm -rf *
rm: cannot remove `References': Directory not empty
rm: cannot remove `Testing': Directory not empty
rm: cannot remove `Velodyne': Directory not empty
rm: cannot remove `progress_reports/pr2': Directory not empty
rm: cannot remove `progress_reports/pr3': Directory not empty

[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# ls -alR
total 0
drwxrws--x 6 root root 449 Feb 4 18:12 .
drwxrwx--- 3 root root 200 Feb 4 18:19 ..
drwxrws--- 3 root root 41 Feb 4 18:12 References *** Note that there is 
nothing in this References directory.

drwxrws--x 4 root root 54 Feb 4 18:12 Testing
drwxrws--- 4 root root 51 Feb 4 18:12 Velodyne
drwxrws--x 4 root root 38 Feb 4 18:12 progress_reports


However, from the bricks (see listings below), there are other 
directories that are not shown. For example, the References directory 
contains the USSOCOM_OPAQUE_ARMOR directory on the brick, but it doesn't 
show up on the volume.


[root@gfs01a USSOCOM_OPAQUE_ARMOR]# pwd
/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor

[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# cd References/
[root@gfs01a References]# ls -al *** There is nothing shown in the 
References directory

total 0
drwxrws--- 3 root root 133 Feb 4 18:12 .
drwxrws--x 7 root root 449 Feb 4 18:12 ..

[root@gfs01a References]# cd USSOCOM_OPAQUE_ARMOR *** From the brick 
listing, I knew the directory name. Even though it isn't shown, I can cd 
to it and see the files.

[root@gfs01a USSOCOM_OPAQUE_ARMOR]# ls -al
total 6787
drwxrws--- 2 streadway sbir 244 Feb 5 21:28 .
drwxrws--- 3 root root 164 Feb 5 21:28 ..
-rwxrw 1 streadway sbir 42440 Jun 19 2014 ARMOR PACKAGES.one
-rwxrw 1 streadway sbir 17248 Jun 19 2014 COMPARISON OF 
SOLUTIONS.one
-rwxrw 1 streadway sbir 38184 Jun 19 2014 CURRENT STANDARD 
ARMORING.one

-rwxrw 1 sgilbert sbir 2974120 Jan 22 09:15 FEASABILITY STUDY.docx
-rwxrw 1 streadway sbir 3826704 Jan 21 14:57 FEASABILITY STUDY.one
-rwxrw 1 streadway sbir 49736 Jan 21 13:18 GIVEN TRADE SPACE.one



The recursive file listed (ls -alR) from each of the bricks shows that 
there are files/directories that do not show up on the /homegfs volume.


[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# ls -alR 
/data/brick0*/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References

/data/brick01a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick01a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR:
total 6648
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 sgilbert sbir 2974120 Jan 22 09:15 FEASABILITY STUDY.docx
-rwxrw 2 streadway sbir 3826704 Jan 21 14:57 FEASABILITY STUDY.one

/data/brick02a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 2 root root 10 Feb 4 18:12 .
drwxrws--x 6 root root 95 Feb 4 18:12 ..

[root@gfs01b ~]# ls -alR 
/data/brick0*/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References

/data/brick01b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick01b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR:
total 6648
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 sgilbert sbir 2974120 Jan 22 09:15 FEASABILITY STUDY.docx
-rwxrw 2 streadway sbir 3826704 Jan 21 14:57 FEASABILITY STUDY.one

/data/brick02b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM1

Re: [Gluster-devel] missing files

2015-02-05 Thread Pranith Kumar Karampuri
This used to happen because of a dht issue, +Raghavendra to check if he 
knows something about this.


Pranith
On 02/06/2015 10:06 AM, David F. Robinson wrote:
Not repeatable.  Once it shows up, it stays there.  I sent some other 
strange behavior I am seeing to Pranith earlier this evening.  
Attached below...


David

Another issue I am having that might be related is that I cannot 
delete some directories. It complains that the directories are not 
empty. But when I list them out, there is nothing there.
However, if I know of the name of the directory, I can cd into it and 
see the files.


[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# pwd
/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor 



[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# ls -al
total 0
drwxrws--x 7 root root 449 Feb 4 18:12 .
drwxrwx--- 3 root root 200 Feb 4 18:19 ..
drwxrws--- 3 root root 41 Feb 4 18:12 References
drwxrws--x 4 root root 54 Feb 4 18:12 Testing
drwxrws--- 4 root root 51 Feb 4 18:12 Velodyne
drwxrws--x 4 root root 38 Feb 4 18:12 progress_reports

[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# rm -rf *
rm: cannot remove `References': Directory not empty
rm: cannot remove `Testing': Directory not empty
rm: cannot remove `Velodyne': Directory not empty
rm: cannot remove `progress_reports/pr2': Directory not empty
rm: cannot remove `progress_reports/pr3': Directory not empty

[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# ls -alR
total 0
drwxrws--x 6 root root 449 Feb 4 18:12 .
drwxrwx--- 3 root root 200 Feb 4 18:19 ..
drwxrws--- 3 root root 41 Feb 4 18:12 References *** Note that there 
is nothing in this References directory.

drwxrws--x 4 root root 54 Feb 4 18:12 Testing
drwxrws--- 4 root root 51 Feb 4 18:12 Velodyne
drwxrws--x 4 root root 38 Feb 4 18:12 progress_reports


However, from the bricks (see listings below), there are other 
directories that are not shown. For example, the References directory 
contains the USSOCOM_OPAQUE_ARMOR directory on the brick, but it 
doesn't show up on the volume.


[root@gfs01a USSOCOM_OPAQUE_ARMOR]# pwd
/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor 



[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# cd References/
[root@gfs01a References]# ls -al *** There is nothing shown in the 
References directory

total 0
drwxrws--- 3 root root 133 Feb 4 18:12 .
drwxrws--x 7 root root 449 Feb 4 18:12 ..

[root@gfs01a References]# cd USSOCOM_OPAQUE_ARMOR *** From the brick 
listing, I knew the directory name. Even though it isn't shown, I can 
cd to it and see the files.

[root@gfs01a USSOCOM_OPAQUE_ARMOR]# ls -al
total 6787
drwxrws--- 2 streadway sbir 244 Feb 5 21:28 .
drwxrws--- 3 root root 164 Feb 5 21:28 ..
-rwxrw 1 streadway sbir 42440 Jun 19 2014 ARMOR PACKAGES.one
-rwxrw 1 streadway sbir 17248 Jun 19 2014 COMPARISON OF SOLUTIONS.one
-rwxrw 1 streadway sbir 38184 Jun 19 2014 CURRENT STANDARD 
ARMORING.one

-rwxrw 1 sgilbert sbir 2974120 Jan 22 09:15 FEASABILITY STUDY.docx
-rwxrw 1 streadway sbir 3826704 Jan 21 14:57 FEASABILITY STUDY.one
-rwxrw 1 streadway sbir 49736 Jan 21 13:18 GIVEN TRADE SPACE.one



The recursive file listed (ls -alR) from each of the bricks shows that 
there are files/directories that do not show up on the /homegfs volume.


[root@gfs01a Phase_1_SOCOM14-003_adv_armor]# ls -alR 
/data/brick0*/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References
/data/brick01a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References: 


total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick01a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR: 


total 6648
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 sgilbert sbir 2974120 Jan 22 09:15 FEASABILITY STUDY.docx
-rwxrw 2 streadway sbir 3826704 Jan 21 14:57 FEASABILITY STUDY.one

/data/brick02a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References: 


total 0
drwxrws--- 2 root root 10 Feb 4 18:12 .
drwxrws--x 6 root root 95 Feb 4 18:12 ..

[root@gfs01b ~]# ls -alR 
/data/brick0*/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References
/data/brick01b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References: 


total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick01b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR: 


total 6648
drwxrws--- 2 streadway sbir 75 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 sgilbert sbir 2974120 Jan 22 09:15

Re: [Gluster-devel] [Gluster-users] Appending time to snap name in USS

2015-02-05 Thread Mohammed Rafi K C
We decided to append time-stamp with snapname when creating a snapshot
by default. Users can override this with a flag "no-timestamp", then
snapshot will be created without appending time-stamp. So the snapshot
create syntax would be like " snapshot create  
[no-timestamp]  [description ] [force] ".

Patch for the same can be found here http://review.gluster.org/#/c/9597/1.

Regards
Rafi KC

On 01/09/2015 12:18 PM, Poornima Gurusiddaiah wrote:
> Yes, the creation time of the snap is appended to the snapname
> dynamically,
> i.e. snapview-server takes the snaplist from glusterd, and while
> populating the dentry for the ".snaps" it appends the time.
>
> Thanks,
> Poornima
>
> 
>
> *From: *"Anand Avati" 
> *To: *"Poornima Gurusiddaiah" , "Gluster
> Devel" , "gluster-users"
> 
> *Sent: *Friday, January 9, 2015 1:49:02 AM
> *Subject: *Re: [Gluster-devel] Appending time to snap name in USS
>
> It would be convenient if the time is appended to the snap name on
> the fly (when receiving list of snap names from glusterd?) so that
> the timezone application can be dynamic (which is what users would
> expect).
>
> Thanks
>
> On Thu Jan 08 2015 at 3:21:15 AM Poornima Gurusiddaiah
> mailto:pguru...@redhat.com>> wrote:
>
> Hi,
>
> Windows has a feature called shadow copy. This is widely used
> by all
> windows users to view the previous versions of a file.
> For shadow copy to work with glusterfs backend, the problem
> was that
> the clients expect snapshots to contain some format
> of time in their name.
>
> After evaluating the possible ways(asking the user to create
> snapshot with some format of time in it and have rename snapshot
> for existing snapshots) the following method seemed simpler.
>
> If the USS is enabled, then the creation time of the snapshot is
> appended to the snapname and is listed in the .snaps directory.
> The actual name of the snapshot is left unmodified. i.e. the 
> snapshot
> list/info/restore etc. commands work with the original snapname.
> The patch for the same can be found
> @http://review.gluster.org/#/c/9371/
>
> The impact is that, the users would see the snapnames to be
> different in the ".snaps" folder
> than what they have created. Also the current patch does not
> take care of the scenario where
> the snapname already has time in its name.
>
> Eg:
> Without this patch:
> drwxr-xr-x 4 root root 110 Dec 26 04:14 snap1
> drwxr-xr-x 4 root root 110 Dec 26 04:14 snap2
>
> With this patch
> drwxr-xr-x 4 root root 110 Dec 26 04:14
> snap1@GMT-2014.12.30-05.07.50
> drwxr-xr-x 4 root root 110 Dec 26 04:14
> snap2@GMT-2014.12.30-23.49.02
>
> Please let me know if you have any suggestions or concerns on
> the same.
>
> Thanks,
> Poornima
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org 
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
>
>
>
>
> ___
> Gluster-users mailing list
> gluster-us...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Input/Output Error on Gluster NFS

2015-02-05 Thread Soumya Koduri



On 02/05/2015 11:32 PM, Peter Auyeung wrote:

Hi Soumya

root@glusterprod001:~# gluster volume info | grep nfs.acl
02/05/15 10:00:05 [ /root ]

Seems like we do not have ACL enabled.

nfs client is a RHEL4 standard NFS client


Oh by default ACLs are enabled. It seem to be shown in 'gluster volume 
info' only if we explicitly modify its value to ON/OFF.


Can you please verify if the filesystem where your Gluster bricks have 
been created has been mounted with ACLs enabled.


Thanks,
Soumya



Thanks
-Peter

From: Soumya Koduri [skod...@redhat.com]
Sent: Wednesday, February 04, 2015 11:28 PM
To: Peter Auyeung; gluster-us...@gluster.org; gluster-devel@gluster.org
Subject: Re: [Gluster-devel] [Gluster-users] Input/Output Error on Gluster NFS

Hi Peter,

Have you disabled Gluster-NFS ACLs .

Please check the option value -
#gluster v info | grep nfs.acl
nfs.acl: ON

Also please provide the details of the nfs-client you are using.
Typically, nfs-clients seem to issue getxattr before doing
setxattr/removexattr operations and return 'ENOTSUPP' incase of ACLs
disabled. But from the strace, looks like the client has issued
'removexattr' of 'system.posix_acl_default' which returned EIO.

Anyways, 'removexattr' should also have returned EOPNOTSUPP instead of EIO.

Thanks,
Soumya

On 02/05/2015 02:31 AM, Peter Auyeung wrote:

I was trying to copy a directory of files to Gluster via NFS and getting
permission denied with Input/Output error

---> r...@bizratedbstandby.bo2.shopzilla.sea (0.00)# cp -pr db /mnt/
cp: setting permissions for 
`/mnt/db/full/pr_bizrate_standby_SMLS.F02-01-22-35.d': Input/output error
cp: setting permissions for 
`/mnt/db/full/pr_bizrate_standby_logging.F02-02-18-10.b': Input/output error
cp: setting permissions for `/mnt/db/full/pr_bizrate_SMLS.F02-01-22-35.d': 
Input/output error
cp: setting permissions for 
`/mnt/db/full/pr_bizrate_standby_master.F02-02-22-00': Input/output error
cp: setting permissions for `/mnt/db/full': Input/output error
cp: setting permissions for `/mnt/db': Input/output error

Checked gluster nfs.log and etc log and bricks looks clean.
The files ends up able to copy over with right permission.

Stack trace the copy and seems like it failed on removexattr

removexattr("/mnt/db", "system.posix_acl_default"...) = -1 EIO (Input/output 
error)

http://pastie.org/9884810

Any Clue?

Thanks
Peter







___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel