Re: [Gluster-devel] missing files

2015-02-17 Thread David F. Robinson

Any updates on this issue?  Thanks in advance...

David


-- Original Message --
From: Shyam srang...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Justin Clift 
jus...@gluster.org

Cc: Gluster Devel gluster-devel@gluster.org
Sent: 2/11/2015 10:02:09 PM
Subject: Re: [Gluster-devel] missing files


On 02/11/2015 08:28 AM, David F. Robinson wrote:
My base filesystem has 40-TB and the tar takes 19 minutes. I copied 
over 10-TB and it took the tar extraction from 1-minute to 7-minutes.


My suspicion is that it is related to number of files and not 
necessarily file size. Shyam is looking into reproducing this behavior 
on a redhat system.


I am able to reproduce the issue on a similar setup internally (at 
least at the surface it seems to be similar to what David is facing).


I will continue the investigation for the root cause.

Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] missing files

2015-02-12 Thread David F. Robinson

Shyam,

You asked me to stop/start the slow volume to see if it fixed the timing 
issue.  I stopped/started homegfs_backup (the production volume with 40+ 
TB) and it didn't make it faster.  I didn't stop/start the fast volume 
to see if it made it slower.  I just did  that and sent out an email.  I 
saw a similar result as Pranith.


however, I tried this test below and saw no issues.  So, i don't know 
why restart the older volume of test3brick slowed it down but the test 
below shows no slowdown.



#... Create 2-new bricks
gluster volume create test4brick 
gfsib01bkp.corvidtec.com:/data/brick01bkp/test4brick 
gfsib01bkp.corvidtec.com:/data/brick02bkp/test4brick
gluster volume create test5brick 
gfsib01bkp.corvidtec.com:/data/brick01bkp/test5brick 
gfsib01bkp.corvidtec.com:/data/brick02bkp/test5brick

gluster volume start test4brick
gluster volume start test5brick

mount /test4brick
mount /test5brick

cp /root/boost_1_57_0.tar /test4brick
cp /root/boost_1_57_0.tar /test5brick

#... Stop/start test4brick to see if this causes a timing issue
umount /test4brick
gluster volume stop test4brick
gluster volume start test4brick
mount /test4brick


#... Run test on both new bricks
cd /test4brick
time tar -xPf boost_1_57_0.tar; time rm -rf boost_1_57_0

real1m29.712s
user0m0.415s
sys 0m2.772s

real0m18.866s
user0m0.087s
sys 0m0.556s

cd /test5brick
time tar -xPf boost_1_57_0.tar; time rm -rf boost_1_57_0

real 1m28.243s
user 0m0.366s
sys 0m2.502s

real 0m18.193s
user 0m0.075s
sys 0m0.543s

#... Repeat again after stop/start of test4brick
umount /test4brick
gluster volume stop test4brick
gluster volume start test4brick
mount /test4brick
cd /test4brick
time tar -xPf boost_1_57_0.tar; time rm -rf boost_1_57_0

real1m25.277s
user0m0.466s
sys 0m3.107s

real0m16.575s
user0m0.084s
sys 0m0.577s

-- Original Message --
From: Shyam srang...@redhat.com
To: Pranith Kumar Karampuri pkara...@redhat.com; Justin Clift 
jus...@gluster.org
Cc: Gluster Devel gluster-devel@gluster.org; David F. Robinson 
david.robin...@corvidtec.com

Sent: 2/12/2015 10:46:14 AM
Subject: Re: [Gluster-devel] missing files


On 02/12/2015 06:22 AM, Pranith Kumar Karampuri wrote:


On 02/12/2015 03:05 PM, Pranith Kumar Karampuri wrote:


On 02/12/2015 09:14 AM, Justin Clift wrote:

On 12 Feb 2015, at 03:02, Shyam srang...@redhat.com wrote:

On 02/11/2015 08:28 AM, David F. Robinson wrote:
Just to increase confidence performed one more test. Stopped the 
volumes

and re-started. Now on both the volumes, the numbers are almost same:

[root@gqac031 gluster-mount]# time rm -rf boost_1_57_0 ; time tar xf
boost_1_57_0.tar.gz

real 1m15.074s
user 0m0.550s
sys 0m4.656s

real 2m46.866s
user 0m5.347s
sys 0m16.047s

[root@gqac031 gluster-mount]# cd /gluster-emptyvol/
[root@gqac031 gluster-emptyvol]# ls
boost_1_57_0.tar.gz
[root@gqac031 gluster-emptyvol]# time tar xf boost_1_57_0.tar.gz

real 2m31.467s
user 0m5.475s
sys 0m15.471s

gqas015.sbu.lab.eng.bos.redhat.com:testvol on /gluster-mount type
fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
gqas015.sbu.lab.eng.bos.redhat.com:emotyvol on /gluster-emptyvol type
fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)


If I remember right, we performed a similar test on David's setup, but 
I believe there was no significant performance gain there. David could 
you clarify?


Just so we know where we are headed :)

Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] missing files

2015-02-12 Thread David F. Robinson
That is very interesting.  I tried this test and received a similar 
result.  Start/stopping the volume causes a timing issue on the blank 
volume.  It seems like there is some parameter getting set when you 
create a volume and gets reset when you start/stop a volume.  Or, 
something gets set during the start/stop operation that causes the 
problem.  Is there a way to list all parameters that are set for a 
volume?  gluster volume info only shows the ones that the user has 
changed from defaults.


[root@gfs01bkp ~]# gluster volume stop test3brick
Stopping volume will make its data inaccessible. Do you want to 
continue? (y/n) y

volume stop: test3brick: success
[root@gfs01bkp ~]# gluster volume start test3brick
volume start: test3brick: success
[root@gfs01bkp ~]# mount /test3brick
[root@gfs01bkp ~]# cd /test3brick/
[root@gfs01bkp test3brick]# date; time tar -xPf boost_1_57_0.tar ; time 
rm -rf boost_1_57_0

Thu Feb 12 10:42:43 EST 2015

real3m46.002s
user0m0.421s
sys 0m2.812s

real0m15.406s
user0m0.092s
sys 0m0.549s


-- Original Message --
From: Pranith Kumar Karampuri pkara...@redhat.com
To: Justin Clift jus...@gluster.org; Shyam srang...@redhat.com
Cc: Gluster Devel gluster-devel@gluster.org; David F. Robinson 
david.robin...@corvidtec.com

Sent: 2/12/2015 6:22:23 AM
Subject: Re: [Gluster-devel] missing files



On 02/12/2015 03:05 PM, Pranith Kumar Karampuri wrote:


On 02/12/2015 09:14 AM, Justin Clift wrote:

On 12 Feb 2015, at 03:02, Shyam srang...@redhat.com wrote:

On 02/11/2015 08:28 AM, David F. Robinson wrote:
My base filesystem has 40-TB and the tar takes 19 minutes. I copied 
over 10-TB and it took the tar extraction from 1-minute to 
7-minutes.


My suspicion is that it is related to number of files and not 
necessarily file size. Shyam is looking into reproducing this 
behavior on a redhat system.
I am able to reproduce the issue on a similar setup internally (at 
least at the surface it seems to be similar to what David is 
facing).


I will continue the investigation for the root cause.
Here is the initial analysis of my investigation: (Thanks for 
providing me with the setup shyam, keep the setup we may need it for 
further analysis)


On bad volume:
 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop
 - --- --- ---  
  0.00 0.00 us 0.00 us 0.00 us 937104 FORGET
  0.00 0.00 us 0.00 us 0.00 us 872478 RELEASE
  0.00 0.00 us 0.00 us 0.00 us 23668 RELEASEDIR
  0.00 41.86 us 23.00 us 86.00 us 92 STAT
  0.01 39.40 us 24.00 us 104.00 us 218 STATFS
  0.28 55.99 us 43.00 us 1152.00 us 4065 SETXATTR
  0.58 56.89 us 25.00 us 4505.00 us 8236 OPENDIR
  0.73 26.80 us 11.00 us 257.00 us 22238 FLUSH
  0.77 152.83 us 92.00 us 8819.00 us 4065 RMDIR
  2.57 62.00 us 21.00 us 409.00 us 33643 WRITE
  5.46 199.16 us 108.00 us 469938.00 us 22238 UNLINK
  6.70 69.83 us 43.00 us .00 us 77809 LOOKUP
  6.97 447.60 us 21.00 us 54875.00 us 12631 READDIRP
  7.73 79.42 us 33.00 us 1535.00 us 78909 SETATTR
 14.11 2815.00 us 176.00 us 2106305.00 us 4065 MKDIR
 54.09 1972.62 us 138.00 us 1520773.00 us 22238 CREATE

On good volume:
 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop
 - --- --- ---  
  0.00 0.00 us 0.00 us 0.00 us 58870 FORGET
  0.00 0.00 us 0.00 us 0.00 us 66016 RELEASE
  0.00 0.00 us 0.00 us 0.00 us 16480 RELEASEDIR
  0.00 61.50 us 58.00 us 65.00 us 2 OPEN
  0.01 39.56 us 16.00 us 112.00 us 71 STAT
  0.02 41.29 us 27.00 us 79.00 us 163 STATFS
  0.03 36.06 us 17.00 us 98.00 us 301 FSTAT
  0.79 62.38 us 39.00 us 269.00 us 4065 SETXATTR
  1.14 242.99 us 25.00 us 28636.00 us 1497 READ
  1.54 59.76 us 25.00 us 6325.00 us 8236 OPENDIR
  1.70 133.75 us 89.00 us 374.00 us 4065 RMDIR
  2.25 32.65 us 15.00 us 265.00 us 22006 FLUSH
  3.37 265.05 us 172.00 us 2349.00 us 4065 MKDIR
  7.14 68.34 us 21.00 us 21902.00 us 33357 WRITE
 11.00 159.68 us 107.00 us 2567.00 us 22003 UNLINK
 13.82 200.54 us 133.00 us 21762.00 us 22003 CREATE
 17.85 448.85 us 22.00 us 54046.00 us 12697 READDIRP
 18.37 76.12 us 45.00 us 294.00 us 77044 LOOKUP
 20.95 85.54 us 35.00 us 1404.00 us 78204 SETATTR

As we can see here, FORGET/RELEASE are way more in the brick from full 
volume compared to the brick from empty volume. It seems to suggest 
that the inode-table on the volume with lots of data is carrying too 
many passive inodes in the table which need to be displaced to create 
new ones. Need to check if they come in the fop-path. Need to continue 
my investigations further, will let you know.
Just to increase confidence performed one more test. Stopped the 
volumes and re-started. Now on both the volumes, the numbers are almost 
same:


[root@gqac031 gluster-mount]# time rm -rf boost_1_57_0 ; time tar xf 
boost_1_57_0.tar.gz


real 1m15.074s

Re: [Gluster-devel] missing files

2015-02-12 Thread David F. Robinson



-- Original Message --
From: Shyam srang...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Pranith Kumar 
Karampuri pkara...@redhat.com; Justin Clift jus...@gluster.org

Cc: Gluster Devel gluster-devel@gluster.org
Sent: 2/12/2015 11:26:51 AM
Subject: Re: [Gluster-devel] missing files


On 02/12/2015 11:18 AM, David F. Robinson wrote:

Shyam,

You asked me to stop/start the slow volume to see if it fixed the 
timing
issue. I stopped/started homegfs_backup (the production volume with 
40+

TB) and it didn't make it faster. I didn't stop/start the fast volume
to see if it made it slower. I just did that and sent out an email. I
saw a similar result as Pranith.


Just to be clear even after restart of the slow volume, we see ~19 
minutes for the tar to complete, correct?

Correct



Versus, on the fast volume it is anywhere between 00:55 - 3:00 minutes, 
irrespective of start, fresh create, etc. correct?

Correct



Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] missing files

2015-02-12 Thread David F. Robinson
FWIW, starting/stopping a volume that is fast doesn't consistently make 
it slow.  I just tried it again on an older volume... It doesn't make it 
slow.  I also went back and re-ran the test on test3brick and it isn't 
slow any longer.  Maybe there is a time lag after stopping/starting a 
volume before it becomes fast.


Either way, stopping/starting a fast volume only makes it slow for 
some period of time and it doesn't consistently make it slow.  I don't 
think this is the issue.  red-herring.


[root@gfs01bkp /]# gluster volume stop test2brick
Stopping volume will make its data inaccessible. Do you want to 
continue? (y/n) y

[root@gfs01bkp /]# gluster volume start test2brick
volume start: test2brick: success
[root@gfs01bkp /]# mount /test2brick
[root@gfs01bkp /]# cd /test2brick
[root@gfs01bkp test2brick]# time tar -xPf boost_1_57_0.tar; time rm -rf 
boost_1_57_0


real1m1.124s
user0m0.432s
sys 0m3.136s

real0m16.630s
user0m0.083s
sys 0m0.570s


#... Retest on test3brick after it has been up after a volume restart 
for 20-minutes... Compare this to running the test immediately after a 
restart which gave a time of 3.5-minutes.
[root@gfs01bkp test3brick]#  time tar -xPf boost_1_57_0.tar; time rm -rf 
boost_1_57_0


real1m17.786s
user0m0.502s
sys 0m3.278s

real0m18.103s
user0m0.101s
sys 0m0.684s



-- Original Message --
From: Shyam srang...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Pranith Kumar 
Karampuri pkara...@redhat.com; Justin Clift jus...@gluster.org

Cc: Gluster Devel gluster-devel@gluster.org
Sent: 2/12/2015 11:26:51 AM
Subject: Re: [Gluster-devel] missing files


On 02/12/2015 11:18 AM, David F. Robinson wrote:

Shyam,

You asked me to stop/start the slow volume to see if it fixed the 
timing
issue. I stopped/started homegfs_backup (the production volume with 
40+

TB) and it didn't make it faster. I didn't stop/start the fast volume
to see if it made it slower. I just did that and sent out an email. I
saw a similar result as Pranith.


Just to be clear even after restart of the slow volume, we see ~19 
minutes for the tar to complete, correct?


Versus, on the fast volume it is anywhere between 00:55 - 3:00 minutes, 
irrespective of start, fresh create, etc. correct?


Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Fw: Re[2]: missing files

2015-02-11 Thread David F. Robinson
I will forward the emails to Shyam to the devel list. 


David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 11, 2015, at 8:21 AM, Pranith Kumar Karampuri pkara...@redhat.com 
 wrote:
 
 
 On 02/11/2015 06:49 PM, Pranith Kumar Karampuri wrote:
 
 On 02/11/2015 08:36 AM, Shyam wrote:
 Did some analysis with David today on this here is a gist for the list,
 
 1) Volumes classified as slow (i.e with a lot of pre-existing data) and 
 fast (new volumes carved from the same backend file system that slow bricks 
 are on, with little or no data)
 
 2) We ran an strace of tar and also collected io-stats outputs from these 
 volumes, both show that create and mkdir is slower on slow as compared to 
 the fast volume. This seems to be the overall reason for slowness.
 Did you happen to do strace of the brick when this happened? If not, David, 
 can we get that information as well?
 It would be nice to compare the difference in syscalls of the bricks of two 
 volumes to see if there are any extra syscalls that is adding to the delay.
 
 Pranith
 
 Pranith
 
 3) The tarball extraction is to a new directory on the gluster mount, so 
 all lookups etc. happen within this new name space on the volume
 
 4) Checked memory footprints of the slow bricks and fast bricks etc. 
 nothing untoward noticed there
 
 5) Restarted the slow volume, just as a test case to do things from 
 scratch, no improvement in performance.
 
 Currently attempting to reproduce this on a local system to see if the same 
 behavior is seen so that it becomes easier to debug etc.
 
 Others on the list can chime in as they see fit.
 
 Thanks,
 Shyam
 
 On 02/10/2015 09:58 AM, David F. Robinson wrote:
 Forwarding to devel list as recommended by Justin...
 
 David
 
 
 -- Forwarded Message --
 From: David F. Robinson david.robin...@corvidtec.com
 To: Justin Clift jus...@gluster.org
 Sent: 2/10/2015 9:49:09 AM
 Subject: Re[2]: [Gluster-devel] missing files
 
 Bad news... I don't think it is the old linkto files. Bad because if
 that was the issue, cleaning up all of bad linkto files would have fixed
 the issue. It seems like the system just gets slower as you add data.
 
 First, I setup a new clean volume (test2brick) on the same system as the
 old one (homegfs_bkp). See 'gluster v info' below. I ran my simple tar
 extraction test on the new volume and it took 58-seconds to complete
 (which, BTW, is 10-seconds faster than my old non-gluster system, so
 kudos). The time on homegfs_bkp is 19-minutes.
 
 Next, I copied 10-terabytes of data over to test2brick and re-ran the
 test which then took 7-minutes. I created a test3brick and ran the test
 and it took 53-seconds.
 
 To confirm all of this, I deleted all of the data from test2brick and
 re-ran the test. It took 51-seconds!!!
 
 BTW. I also checked the .glusterfs for stale linkto files (find . -type
 f -size 0 -perm 1000 -exec ls -al {} \;). There are many, many thousands
 of these types of files on the old volume and none on the new one, so I
 don't think this is related to the performance issue.
 
 Let me know how I should proceed. Send this to devel list? Pranith?
 others? Thanks...
 
 [root@gfs01bkp .glusterfs]# gluster volume info homegfs_bkp
 Volume Name: homegfs_bkp
 Type: Distribute
 Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294
 Status: Started
 Number of Bricks: 2
 Transport-type: tcp
 Bricks:
 Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp
 Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp
 
 [root@gfs01bkp .glusterfs]# gluster volume info test2brick
 Volume Name: test2brick
 Type: Distribute
 Volume ID: 123259b2-3c61-4277-a7e8-27c7ec15e550
 Status: Started
 Number of Bricks: 2
 Transport-type: tcp
 Bricks:
 Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test2brick
 Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test2brick
 
 [root@gfs01bkp glusterfs]# gluster volume info test3brick
 Volume Name: test3brick
 Type: Distribute
 Volume ID: 9b1613fc-f7e5-4325-8f94-e3611a5c3701
 Status: Started
 Number of Bricks: 2
 Transport-type: tcp
 Bricks:
 Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test3brick
 Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test3brick
 
 
 From homegfs_bkp:
 # find . -type f -size 0 -perm 1000 -exec ls -al {} \;
 T 2 gmathur pme_ics 0 Jan 9 16:59
 ./00/16/00169a69-1a7a-44c9-b2d8-991671ee87c4
 -T 3 jcowan users 0 Jan 9 17:51
 ./00/16/0016a0a0-fd22-4fb5-b6fb-5d7f9024ab74
 -T 2 morourke sbir 0 Jan 9 18:17
 ./00/16/0016b36f-32fc-4f2c-accd-e36be2f6c602
 -T 2 carpentr irl 0 Jan 9 18:52
 ./00/16/00163faf-741c-4e40-8081-784786b3cc71
 -T 3 601 raven 0 Jan 9 22:49
 ./00/16/00163385-a332-4050-8104-1b1af6cd8249
 -T 3 bangell sbir 0 Jan 9 22:56
 ./00/16/00167803-0244-46de-8246

Re: [Gluster-devel] missing files

2015-02-11 Thread David F. Robinson
My base filesystem has 40-TB and the tar takes 19 minutes. I copied over 10-TB 
and it took the tar extraction from 1-minute to 7-minutes. 

My suspicion is that it is related to number of files and not necessarily file 
size. Shyam is looking into reproducing this behavior on a redhat system. 

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 11, 2015, at 7:38 AM, Justin Clift jus...@gluster.org wrote:
 
 On 11 Feb 2015, at 12:31, David F. Robinson david.robin...@corvidtec.com 
 wrote:
 
 Some time ago I had a similar performance problem (with 3.4 if I remember 
 correctly): a just created volume started to work fine, but after some time 
 using it performance was worse. Removing all files from the volume didn't 
 improve the performance again.
 
 I guess my problem is a little better depending on how you look at it. If I 
 date the data from the volume, the performance goes back to that of an empty 
 volume. I don't have to delete the .glusterfs entries to regain my 
 performance. I only have to delete the data from the mount point.
 
 Interesting.  Do you have somewhat accurate stats on how much data (eg # of 
 entries, size
 of files) was in the data set that did this?
 
 Wondering if it's repeatable, so we can replicate the problem and solve. :)
 
 + Justin
 
 --
 GlusterFS - http://www.gluster.org
 
 An open source, distributed file system scaling to several
 petabytes, and handling thousands of clients.
 
 My personal twitter: twitter.com/realjustinclift
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] missing files

2015-02-11 Thread David F. Robinson
Don't think it is the underlying file system. /data/brickxx is the underlying 
xfs. Performance to this is fine. When I created a volume it just puts the data 
in /data/brick/test2. The underlying filesystem shouldn't know/care that it is 
in a new directory. 

Also, if I create a /data/brick/test2 volume and put data on it, it gets slow 
in gluster. But, writing to /data/brick is still fine. And, after test2 gets 
slow, I can create a /data/test3 volume that is empty and its speed is fine. 

My knowledge is admittedly very limited here, but I don't see how it could be 
the underlying filesystem if the slowdown only occurs on the gluster mount and 
not on the underlying xfs filesystem. 

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 11, 2015, at 12:18 AM, Justin Clift jus...@gluster.org wrote:
 
 On 11 Feb 2015, at 03:06, Shyam srang...@redhat.com wrote:
 snip
 2) We ran an strace of tar and also collected io-stats outputs from these 
 volumes, both show that create and mkdir is slower on slow as compared to 
 the fast volume. This seems to be the overall reason for slowness
 
 Any idea's on why the create and mkdir is slower?
 
 Wondering if it's a case of underlying filesystem parameters (for the bricks)
 + maybe physical storage structure having become badly optimised over time.
 eg if its on spinning rust, not ssd, and sector placement is now bad
 
 Any idea if there are tools that can analyse this kind of thing?  eg meta
 data placement / fragmentation / on a drive for XFS/ext4
 
 + Justin
 
 --
 GlusterFS - http://www.gluster.org
 
 An open source, distributed file system scaling to several
 petabytes, and handling thousands of clients.
 
 My personal twitter: twitter.com/realjustinclift
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] stale file handle

2015-02-10 Thread David F. Robinson
I am seeing the following on one of my FUSE clients (indy.rst and 
indy.rst.old has ??? ???).
Has anyone seen this before?  Any idea what causes this for a given 
client?

If I try to access the file, I get a stale file handle..

# cp indy.rst dfr.rst
cp: cannot stat `indy.rst': Stale file handle
Bottom of the log has:
[2015-02-10 22:20:34.913632] I [dht-rename.c:1344:dht_rename] 
0-homegfs-dht: renaming 
/hpc_shared/motorsports/gmics/Raven/p4/133/dc3.tmp 
(hash=homegfs-replicate-2/cache=homegfs-replicate-2) = 
/hpc_shared/motorsports/gmics/Raven/p4/133/data_collected3 
(hash=homegfs-replicate-3/cache=homegfs-replicate-2)
[2015-02-10 22:40:04.138594] W 
[client-rpc-fops.c:504:client3_3_stat_cbk] 0-homegfs-client-1: remote 
operation failed: Stale file handle
[2015-02-10 22:40:04.158855] W [MSGID: 108008] 
[afr-read-txn.c:221:afr_read_txn] 0-homegfs-replicate-0: Unreadable 
subvolume -1 found with event generation 2. (Possible split-brain)
[2015-02-10 22:40:04.202696] W [fuse-bridge.c:779:fuse_attr_cbk] 
0-glusterfs-fuse: 1396664: STAT() 
/hpc_shared/motorsports/gmics/Raven/p3/70_sst_r4_1em3/indy.rst.old = -1 
(Stale file handle)
The message W [MSGID: 108008] [afr-read-txn.c:221:afr_read_txn] 
0-homegfs-replicate-0: Unreadable subvolume -1 found with event 
generation 2. (Possible split-brain) repeated 14 times between 
[2015-02-10 22:40:04.158855] and [2015-02-10 22:41:00.610296]
[2015-02-10 22:41:45.339419] W [MSGID: 108008] 
[afr-read-txn.c:221:afr_read_txn] 0-homegfs-replicate-0: Unreadable 
subvolume -1 found with event generation 2. (Possible split-brain)
The message W [MSGID: 108008] [afr-read-txn.c:221:afr_read_txn] 
0-homegfs-replicate-0: Unreadable subvolume -1 found with event 
generation 2. (Possible split-brain) repeated 31 times between 
[2015-02-10 22:41:45.339419] and [2015-02-10 22:43:11.483421]
[2015-02-10 22:43:37.498720] W [MSGID: 108008] 
[afr-read-txn.c:221:afr_read_txn] 0-homegfs-replicate-0: Unreadable 
subvolume -1 found with event generation 2. (Possible split-brain)






However, the same files on my other FUSE clients look fine:


From the storage system:



From another client:___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] cannot delete non-empty directory

2015-02-09 Thread David F. Robinson
So, just to be sure before I do this, it is okay to do the following if 
I want to get rid of everything in the /old_shelf4/Aegis directory and 
below?


 rm -rf /data/brick*/homegfs_bkp/backup.0/old_shelf4/Aegis

What happens to all of the files in the .glusterfs directory?  Does this 
get rebuilt or do the links stay there for files that now no longer 
exist?


And, is this same issue what causes all of the broken links in 
.glusterfs.  See attached image for example.  There appears to be a lot 
of broken links the .glusterfs directories.  Is this normal or does it 
indicate another problem.


Finally, if I search through the /data/brick* directories, should I find 
no entries of ---T permission files with zero length files?  Do I 
need to clean all of these up somehow?  A quick look at 
/data/brick01bkp/homegfs_bkp/.glusterfs/2f/54 shows many of these files. 
 They look like
-T   3 rbhinge  pme_ics   0 Jan  9 16:45 
2f54d7d6-968b-442f-8cfe-eff01d6cefe7
-T   2 rbhinge  pme_ics   0 Jan  9 21:40 
2f54d7e7-b198-4fd4-aec7-f5d0ff020f72


How do I find out what file these entries were pointing to?

David




-- Original Message --
From: Shyam srang...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Gluster Devel 
gluster-devel@gluster.org; gluster-us...@gluster.org 
gluster-us...@gluster.org; Susant Palai spa...@redhat.com

Sent: 2/9/2015 11:11:20 AM
Subject: Re: [Gluster-devel] cannot delete non-empty directory


On 02/08/2015 12:19 PM, David F. Robinson wrote:

I am seeing these messsages after I delete large amounts of data using
gluster 3.6.2.
cannot delete non-empty directory:
old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final
*_From the FUSE mount (as root), the directory shows up as empty:_*
# pwd
/backup/homegfs/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final

# ls -al
total 5
d- 2 root root 4106 Feb 6 13:55 .
drwxrws--- 3 601 dmiller 72 Feb 6 13:55 ..
However, when you look at the bricks, the files are still there (none 
on
brick01bkp, all files are on brick02bkp). All of the files are 
0-length

and have --T permissions.


These files are linkto files that are created by DHT, which basically 
mean the files were either renamed, or the brick layout changed (I 
suspect the former to be the cause).


These files should have been deleted when the files that they point to 
were deleted, looks like this did not happen.


Can I get the following information for some of the files here,
- getfattr -d -m . -e text path to file on brick
  - The output of trusted.glusterfs.dht.linkto xattr should state where 
the real file belongs, in this case as there are only 2 bricks, it 
should be brick01bkp subvol
- As the second brick is empty, we should be able to safely delete 
these files from the brick and proceed to do an rmdir on the mount 
point of the volume as the directory is now empty.
- Please check, the one sub-directory that is showing up in this case 
as well, save1


Any suggestions on how to fix this and how to prevent it from 
happening?


I believe there are renames happening here, possibly by the archive 
creator, one way to prevent the rename from creating a linkto file is 
to use the DHT set parameter to set a pattern so that file name hash 
considers only the static part of the name.


The set parameter is, cluster.extra-hash-regex.

A link on a similar problem and how to use this set parameter (there a 
few in the gluster forums) would be, 
http://www.gluster.org/pipermail/gluster-devel/2014-November/042863.html


Additionally, there is a bug here, the unlink of the file should have 
cleaned up the linkto as well, so that all of the above is not 
required, we have noticed this with NFS and FUSE mounts (ref bugs, 
1117923, 1139992), and investigation is in progress on the same. We 
will step up the priority on this so that we have a clean fix that can 
be used to prevent this in the future.


Shyam___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] cannot delete non-empty directory

2015-02-09 Thread David F. Robinson
 
/backup/homegfs/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24

[root@gfs01bkp C24]# ls -al
total 1
drwx-- 2 jcowan users 39 Feb  6 12:41 .
drwxrw-rw- 4 jcowan users 62 Feb  6 19:19 ..

[root@gfs01bkp C24]# ls -al 
/data/brick*/homegfs_bkp/backup.0/old_shelf4/Aegis/\!\!\!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/

total 4
drwxrw-rw-+ 2 jcowan users 4096 Feb  6 12:41 .
drwxrw-rw-+ 3 jcowan users   29 Feb  6 12:41 ..
-T  5 jcowan users0 Nov 19 23:30 
c24-airbox_vr_z=25_zoom.jpeg

-T  5 jcowan users0 Nov 19 23:30 c24-airbox_vr_z=26.jpeg
-T  5 jcowan users0 Nov 19 23:30 c24-airbox_vr_z=27.jpeg
-T  5 jcowan users0 Nov 19 23:30 c24-airbox_vr_z=28.jpeg
-T  5 jcowan users0 Nov 19 23:30 
c24-airbox_vr_z=29.5_zoom.jpeg

-T  5 jcowan users0 Nov 19 23:30 c24-airbox_vr_z=30.jpeg
-T  5 jcowan users0 Nov 19 23:30 c24-airbox_vr_z=31.jpeg
-T  5 jcowan users0 Nov 19 23:30 c24-airbox_vr_z=32.5.jpeg
[root@gfs01bkp C24]# getfattr -d -m . -e text 
/data/brick*bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/\!\!\!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/*

getfattr: Removing leading '/' from absolute path names
# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=25_zoom.jpeg

trusted.gfid=îr'V*N©ÍÆF¿
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=26.jpeg

trusted.gfid=Là¾}®ÀLdza¥U
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=27.jpeg

trusted.gfid=©.ñªû2@¬ºÜdíÁ?%_
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=28.jpeg

trusted.gfid=0¥
/DªÒx?Ïý
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=29.5_zoom.jpeg

trusted.gfid=¼9T'$²Cí¯Eÿx!1
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=30.jpeg

trusted.gfid=tè
8rð
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=31.jpeg

trusted.gfid=x´Å
 EŦ¡ZmØWà
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1

# file: 
data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/Nextel_Cup/SHR/Backup/shr/Airbox/C24/z_slices/c24-airbox_vr_z=32.5.jpeg

trusted.gfid=d+0ÇxþM¯GxÑ@Â
trusted.glusterfs.dht.linkto=homegfs_bkp-client-1




-- Original Message --
From: Shyam srang...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Gluster Devel 
gluster-devel@gluster.org; gluster-us...@gluster.org 
gluster-us...@gluster.org; Susant Palai spa...@redhat.com

Sent: 2/9/2015 11:11:20 AM
Subject: Re: [Gluster-devel] cannot delete non-empty directory


On 02/08/2015 12:19 PM, David F. Robinson wrote:

I am seeing these messsages after I delete large amounts of data using
gluster 3.6.2.
cannot delete non-empty directory:
old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final
*_From the FUSE mount (as root), the directory shows up as empty:_*
# pwd
/backup/homegfs/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final

# ls -al
total 5
d- 2 root root 4106 Feb 6 13:55 .
drwxrws--- 3 601 dmiller 72 Feb 6 13:55 ..
However, when you look at the bricks, the files are still there (none 
on
brick01bkp, all files are on brick02bkp). All of the files are 
0-length

and have --T permissions.


These files are linkto files that are created by DHT, which basically 
mean the files were either renamed, or the brick layout changed (I 
suspect the former to be the cause).


These files should have been deleted when the files that they point to 
were deleted, looks like this did not happen.


Can I get the following information for some of the files here,
- getfattr -d -m . -e text path to file on brick
  - The output of trusted.glusterfs.dht.linkto xattr should state where 
the real file belongs, in this case as there are only 2 bricks, it 
should be brick01bkp subvol
- As the second brick is empty, we should be able to safely delete 
these files from the brick and proceed to do an rmdir on the mount 
point of the volume as the directory is now empty.
- Please check, the one sub-directory that is showing up in this case 
as well, save1


Any suggestions

[Gluster-devel] cannot delete non-empty directory

2015-02-08 Thread David F. Robinson
I am seeing these messsages after I delete large amounts of data using 
gluster 3.6.2.
cannot delete non-empty directory: 
old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final



From the FUSE mount (as root), the directory shows up as empty:


# pwd
/backup/homegfs/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final

# ls -al
total 5
d- 2 root root4106 Feb  6 13:55 .
drwxrws--- 3  601 dmiller   72 Feb  6 13:55 ..

However, when you look at the bricks, the files are still there (none on 
brick01bkp, all files are on brick02bkp).  All of the files are 0-length 
and have --T permissions.

Any suggestions on how to fix this and how to prevent it from happening?

#  ls -al 
/data/brick*/homegfs_bkp/backup.0/old_shelf4/Aegis/\!\!\!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final

/data/brick01bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final:
total 4
d-+ 2 root root  10 Feb  6 13:55 .
drwxrws---+ 3  601 raven 36 Feb  6 13:55 ..

/data/brick02bkp/homegfs_bkp/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final:
total 8
d-+ 3 root root  4096 Dec 31  1969 .
drwxrws---+ 3  601 raven   36 Feb  6 13:55 ..
-T  5  601 raven0 Nov 20 00:08 read_inset.f.gz
-T  5  601 raven0 Nov 20 00:08 readbc.f.gz
-T  5  601 raven0 Nov 20 00:08 readcn.f.gz
-T  5  601 raven0 Nov 20 00:08 readinp.f.gz
-T  5  601 raven0 Nov 20 00:08 readinp_v1_2.f.gz
-T  5  601 raven0 Nov 20 00:08 readinp_v1_3.f.gz
-T  5  601 raven0 Nov 20 00:08 rotatept.f.gz
d-+ 2 root root   118 Feb  6 13:54 save1
-T  5  601 raven0 Nov 20 00:08 sepvec.f.gz
-T  5  601 raven0 Nov 20 00:08 shadow.f.gz
-T  5  601 raven0 Nov 20 00:08 snksrc.f.gz
-T  5  601 raven0 Nov 20 00:08 source.f.gz
-T  5  601 raven0 Nov 20 00:08 step.f.gz
-T  5  601 raven0 Nov 20 00:08 stoprog.f.gz
-T  5  601 raven0 Nov 20 00:08 summer6.f.gz
-T  5  601 raven0 Nov 20 00:08 totforc.f.gz
-T  5  601 raven0 Nov 20 00:08 tritet.f.gz
-T  5  601 raven0 Nov 20 00:08 wallrsd.f.gz
-T  5  601 raven0 Nov 20 00:08 wheat.f.gz
-T  5  601 raven0 Nov 20 00:08 write_inset.f.gz


This is using gluster 3.6.2 on a distributed gluster volume that resides 
on a single machine.  Both of the bricks are on one machine consisting 
of 2x RAID-6 arrays.


df -h | grep brick
/dev/mapper/vg01-lvol1   88T   22T   66T  25% 
/data/brick01bkp
/dev/mapper/vg02-lvol1   88T   22T   66T  26% 
/data/brick02bkp


# gluster volume info homegfs_bkp
Volume Name: homegfs_bkp
Type: Distribute
Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp
Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp
Options Reconfigured:
storage.owner-gid: 100
performance.io-thread-count: 32
server.allow-insecure: on
network.ping-timeout: 10
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.manage-gids: on
changelog.rollover-time: 15
changelog.fsync-interval: 3



===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] missing files

2015-02-06 Thread David F. Robinson
I don't think I understood what you sent enough to give it a try.  I'll 
wait until it comes out in a beta or release version.


David


-- Original Message --
From: Ben Turner btur...@redhat.com
To: Justin Clift jus...@gluster.org; David F. Robinson 
david.robin...@corvidtec.com
Cc: Benjamin Turner bennytu...@gmail.com; gluster-us...@gluster.org; 
Gluster Devel gluster-devel@gluster.org

Sent: 2/6/2015 3:33:42 PM
Subject: Re: [Gluster-devel] [Gluster-users] missing files


- Original Message -

 From: Justin Clift jus...@gluster.org
 To: Benjamin Turner bennytu...@gmail.com
 Cc: David F. Robinson david.robin...@corvidtec.com, 
gluster-us...@gluster.org, Gluster Devel

 gluster-devel@gluster.org, Ben Turner btur...@redhat.com
 Sent: Friday, February 6, 2015 3:27:53 PM
 Subject: Re: [Gluster-devel] [Gluster-users] missing files

 On 6 Feb 2015, at 02:05, Benjamin Turner bennytu...@gmail.com 
wrote:
  I think that the multi threaded epoll changes that _just_ landed in 
master
  will help resolve this, but they are so new I haven't been able to 
test

  this. I'll know more when I get a chance to test tomorrow.

 Which multi-threaded epoll code just landed in master? Are you 
thinking

 of this one?

   http://review.gluster.org/#/c/3842/

 If so, it's not in master yet. ;)


Doh! I just saw - Required patches are all upstream now and assumed 
they were merged. I have been in class all week so I am not up2date 
with everything. I gave instructions on compiling it from the gerrit 
patches + master so if David wants to give it a go he can. Sorry for 
the confusion.


-b


 + Justin


  -b
 
  On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson
  david.robin...@corvidtec.com wrote:
  Isn't rsync what geo-rep uses?
 
  David (Sent from mobile)
 
  ===
  David F. Robinson, Ph.D.
  President - Corvid Technologies
  704.799.6944 x101 [office]
  704.252.1310 [cell]
  704.799.7974 [fax]
  david.robin...@corvidtec.com
  http://www.corvidtechnologies.com
 
   On Feb 5, 2015, at 5:41 PM, Ben Turner btur...@redhat.com 
wrote:

  
   - Original Message -
   From: Ben Turner btur...@redhat.com
   To: David F. Robinson david.robin...@corvidtec.com
   Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier 
Hernandez

   xhernan...@datalab.es, Benjamin Turner
   bennytu...@gmail.com, gluster-us...@gluster.org, Gluster 
Devel

   gluster-devel@gluster.org
   Sent: Thursday, February 5, 2015 5:22:26 PM
   Subject: Re: [Gluster-users] [Gluster-devel] missing files
  
   - Original Message -
   From: David F. Robinson david.robin...@corvidtec.com
   To: Ben Turner btur...@redhat.com
   Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier 
Hernandez

   xhernan...@datalab.es, Benjamin Turner
   bennytu...@gmail.com, gluster-us...@gluster.org, Gluster 
Devel

   gluster-devel@gluster.org
   Sent: Thursday, February 5, 2015 5:01:13 PM
   Subject: Re: [Gluster-users] [Gluster-devel] missing files
  
   I'll send you the emails I sent Pranith with the logs. What 
causes

   these
   disconnects?
  
   Thanks David! Disconnects happen when there are interruption in
   communication between peers, normally there is ping timeout that
   happens.
   It could be anything from a flaky NW to the system was to busy 
to

   respond
   to the pings. My initial take is more towards the ladder as 
rsync is
   absolutely the worst use case for gluster - IIRC it writes in 
4kb

   blocks. I
   try to keep my writes at least 64KB as in my testing that is the
   smallest
   block size I can write with before perf starts to really drop 
off. I'll

   try
   something similar in the lab.
  
   Ok I do think that the file being self healed is RCA for what you 
were

   seeing. Lets look at one of the disconnects:
  
   data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
   [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
   connection from
   
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

  
   And in the glustershd.log from the gfs01b_glustershd.log file:
  
   [2015-02-03 20:55:48.001797] I
   [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0:

   performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
   [2015-02-03 20:55:49.341996] I
   [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0:

   Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448.
   source=1 sinks=0
   [2015-02-03 20:55:49.343093] I
   [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0:

   performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
   [2015-02-03 20:55:50.463652] I
   [afr-self-heal-common.c:476:afr_log_selfheal] 
0-homegfs-replicate-0:

   Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69.
   source=1 sinks=0
   [2015-02-03 20:55:51.465289] I
   [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do]
   0-homegfs-replicate-0: performing metadata

Re: [Gluster-devel] missing files

2015-02-05 Thread David F. Robinson
/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 2 root root 10 Feb 4 18:12 .
drwxrws--x 6 root root 95 Feb 4 18:12 ..

[root@gfs02a ~]# ls -alR 
/data/brick0*/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References

/data/brick01a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 80 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick01a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR:
total 72
drwxrws--- 2 streadway sbir 80 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 streadway sbir 17248 Jun 19 2014 COMPARISON OF 
SOLUTIONS.one

-rwxrw 2 streadway sbir 49736 Jan 21 13:18 GIVEN TRADE SPACE.one

/data/brick02a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 79 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick02a/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR:
total 84
drwxrws--- 2 streadway sbir 79 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 streadway sbir 42440 Jun 19 2014 ARMOR PACKAGES.one
-rwxrw 2 streadway sbir 38184 Jun 19 2014 CURRENT STANDARD 
ARMORING.one


[root@gfs02b ~]# ls -alR 
/data/brick0*/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References

/data/brick01b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 80 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick01b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR:
total 72
drwxrws--- 2 streadway sbir 80 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 streadway sbir 17248 Jun 19 2014 COMPARISON OF 
SOLUTIONS.one

-rwxrw 2 streadway sbir 49736 Jan 21 13:18 GIVEN TRADE SPACE.one

/data/brick02b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References:
total 0
drwxrws--- 3 root root 41 Feb 4 18:12 .
drwxrws--x 7 root root 118 Feb 4 18:12 ..
drwxrws--- 2 streadway sbir 79 Jan 23 14:46 USSOCOM_OPAQUE_ARMOR

/data/brick02b/homegfs/documentation/programs/OLD_PROGRAMS/SBIR_TOM/Phase_1_SOCOM14-003_adv_armor/References/USSOCOM_OPAQUE_ARMOR:
total 84
drwxrws--- 2 streadway sbir 79 Jan 23 14:46 .
drwxrws--- 3 root root 41 Feb 4 18:12 ..
-rwxrw 2 streadway sbir 42440 Jun 19 2014 ARMOR PACKAGES.one
-rwxrw 2 streadway sbir 38184 Jun 19 2014 CURRENT STANDARD 
ARMORING.one






-- Original Message --
From: Xavier Hernandez xhernan...@datalab.es
To: David F. Robinson david.robin...@corvidtec.com; Benjamin 
Turner bennytu...@gmail.com; Pranith Kumar Karampuri 
pkara...@redhat.com
Cc: gluster-us...@gluster.org gluster-us...@gluster.org; Gluster 
Devel gluster-devel@gluster.org

Sent: 2/5/2015 5:14:22 AM
Subject: Re: [Gluster-devel] missing files


Is the failure repeatable ? with the same directories ?

It's very weird that the directories appear on the volume when you do 
an 'ls' on the bricks. Could it be that you only made a single 'ls' on 
fuse mount which not showed the directory ? Is it possible that this 
'ls' triggered a self-heal that repaired the problem, whatever it was, 
and when you did another 'ls' on the fuse mount after the 'ls' on the 
bricks, the directories were there ?


The first 'ls' could have healed the files, causing that the following 
'ls' on the bricks showed the files as if nothing were damaged. If 
that's the case, it's possible that there were some disconnections 
during the copy.


Added Pranith because he knows better replication and self-heal 
details.


Xavi

On 02/04/2015 07:23 PM, David F. Robinson wrote:

Distributed/replicated

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
It was a mix of files from very small to very large. And many terabytes of 
data. Approx 20tb

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 5, 2015, at 4:55 PM, Ben Turner btur...@redhat.com wrote:
 
 - Original Message -
 From: Pranith Kumar Karampuri pkara...@redhat.com
 To: Xavier Hernandez xhernan...@datalab.es, David F. Robinson 
 david.robin...@corvidtec.com, Benjamin Turner
 bennytu...@gmail.com
 Cc: gluster-us...@gluster.org, Gluster Devel gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:30:04 AM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files
 
 
 On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
 I believe David already fixed this. I hope this is the same issue he
 told about permissions issue.
 Oops, it is not. I will take a look.
 
 Yes David exactly like these:
 
 data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from 
 gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from 
 gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from 
 gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
 
 You can 100% verify my theory if you can correlate the time on the 
 disconnects to the time that the missing files were healed.  Can you have a 
 look at /var/log/glusterfs/glustershd.log?  That has all of the healed files 
 + timestamps, if we can see a disconnect during the rsync and a self heal of 
 the missing file I think we can safely assume that the disconnects may have 
 caused this.  I'll try this on my test systems, how much data did you rsync?  
 What size ish of files / an idea of the dir layout?  
 
 @Pranith - Could bricks flapping up and down during the rsync cause the files 
 to be missing on the first ls(written to 1 subvol but not the other cause it 
 was down), the ls triggered SH, and thats why the files were there for the 
 second ls be a possible cause here?
 
 -b
 
 
 Pranith
 
 Pranith
 On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
 Is the failure repeatable ? with the same directories ?
 
 It's very weird that the directories appear on the volume when you do
 an 'ls' on the bricks. Could it be that you only made a single 'ls'
 on fuse mount which not showed the directory ? Is it possible that
 this 'ls' triggered a self-heal that repaired the problem, whatever
 it was, and when you did another 'ls' on the fuse mount after the
 'ls' on the bricks, the directories were there ?
 
 The first 'ls' could have healed the files, causing that the
 following 'ls' on the bricks showed the files as if nothing were
 damaged. If that's the case, it's possible that there were some
 disconnections during the copy.
 
 Added Pranith because he knows better replication and self-heal details.
 
 Xavi
 
 On 02/04/2015 07:23 PM, David F. Robinson wrote:
 Distributed/replicated
 
 Volume Name: homegfs
 Type: Distributed-Replicate
 Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
 Status: Started
 Number of Bricks: 4 x 2 = 8
 Transport-type: tcp
 Bricks:
 Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
 Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
 Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
 Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
 Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
 Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
 Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
 Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
 Options Reconfigured:
 performance.io-thread-count: 32
 performance.cache-size: 128MB
 performance.write-behind-window-size: 128MB
 server.allow-insecure: on
 network.ping-timeout: 10
 storage.owner-gid: 100
 geo-replication.indexing: off
 geo-replication.ignore-pid-check: on
 changelog.changelog: on
 changelog.fsync-interval: 3
 changelog.rollover-time: 15
 server.manage-gids: on
 
 
 -- Original Message --
 From: Xavier Hernandez xhernan...@datalab.es
 To: David F. Robinson david.robin...@corvidtec.com

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
Should I run my rsync with --block-size = something other than the default? Is 
there an optimal value? I think 128k is the max from my quick search. Didn't 
dig into it throughly though. 

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 5, 2015, at 5:41 PM, Ben Turner btur...@redhat.com wrote:
 
 - Original Message -
 From: Ben Turner btur...@redhat.com
 To: David F. Robinson david.robin...@corvidtec.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier Hernandez 
 xhernan...@datalab.es, Benjamin Turner
 bennytu...@gmail.com, gluster-us...@gluster.org, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:22:26 PM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files
 
 - Original Message -
 From: David F. Robinson david.robin...@corvidtec.com
 To: Ben Turner btur...@redhat.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier Hernandez
 xhernan...@datalab.es, Benjamin Turner
 bennytu...@gmail.com, gluster-us...@gluster.org, Gluster Devel
 gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:01:13 PM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files
 
 I'll send you the emails I sent Pranith with the logs. What causes these
 disconnects?
 
 Thanks David!  Disconnects happen when there are interruption in
 communication between peers, normally there is ping timeout that happens.
 It could be anything from a flaky NW to the system was to busy to respond
 to the pings.  My initial take is more towards the ladder as rsync is
 absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
 try to keep my writes at least 64KB as in my testing that is the smallest
 block size I can write with before perf starts to really drop off.  I'll try
 something similar in the lab.
 
 Ok I do think that the file being self healed is RCA for what you were 
 seeing.  Lets look at one of the disconnects:
 
 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
 
 And in the glustershd.log from the gfs01b_glustershd.log file:
 
 [2015-02-03 20:55:48.001797] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
 [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed entry selfheal on 
 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
 [2015-02-03 20:55:49.343093] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
 [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed entry selfheal on 
 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
 [2015-02-03 20:55:51.465289] I 
 [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
 0-homegfs-replicate-0: performing metadata selfheal on 
 403e661a-1c27-4e79-9867-c0572aba2b3c
 [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed metadata selfheal on 
 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
 [2015-02-03 20:55:51.467098] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
 [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed entry selfheal on 
 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
 [2015-02-03 20:55:55.258548] I 
 [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
 0-homegfs-replicate-0: performing metadata selfheal on 
 c612ee2f-2fb4-4157-a9ab-5a2d5603c541
 [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed metadata selfheal on 
 c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
 [2015-02-03 20:55:55.259980] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
 
 As you can see the self heal logs are just spammed with files being healed, 
 and I looked at a couple of disconnects and I see self heals getting run 
 shortly after on the bricks that were down.  Now we need to find the cause of 
 the disconnects, I am thinking once the disconnects are resolved the files 
 should be properly copied over without SH having to fix things.  Like I said 
 I'll give this a go on my lab systems and see if I can repro the disconnects, 
 I'll have time to run through it tomorrow.  If in the mean time anyone else 
 has

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
I'll send you the emails I sent Pranith with the logs. What causes these 
disconnects?

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 5, 2015, at 4:55 PM, Ben Turner btur...@redhat.com wrote:
 
 - Original Message -
 From: Pranith Kumar Karampuri pkara...@redhat.com
 To: Xavier Hernandez xhernan...@datalab.es, David F. Robinson 
 david.robin...@corvidtec.com, Benjamin Turner
 bennytu...@gmail.com
 Cc: gluster-us...@gluster.org, Gluster Devel gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:30:04 AM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files
 
 
 On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
 I believe David already fixed this. I hope this is the same issue he
 told about permissions issue.
 Oops, it is not. I will take a look.
 
 Yes David exactly like these:
 
 data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from 
 gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from 
 gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from 
 gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
 
 You can 100% verify my theory if you can correlate the time on the 
 disconnects to the time that the missing files were healed.  Can you have a 
 look at /var/log/glusterfs/glustershd.log?  That has all of the healed files 
 + timestamps, if we can see a disconnect during the rsync and a self heal of 
 the missing file I think we can safely assume that the disconnects may have 
 caused this.  I'll try this on my test systems, how much data did you rsync?  
 What size ish of files / an idea of the dir layout?  
 
 @Pranith - Could bricks flapping up and down during the rsync cause the files 
 to be missing on the first ls(written to 1 subvol but not the other cause it 
 was down), the ls triggered SH, and thats why the files were there for the 
 second ls be a possible cause here?
 
 -b
 
 
 Pranith
 
 Pranith
 On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
 Is the failure repeatable ? with the same directories ?
 
 It's very weird that the directories appear on the volume when you do
 an 'ls' on the bricks. Could it be that you only made a single 'ls'
 on fuse mount which not showed the directory ? Is it possible that
 this 'ls' triggered a self-heal that repaired the problem, whatever
 it was, and when you did another 'ls' on the fuse mount after the
 'ls' on the bricks, the directories were there ?
 
 The first 'ls' could have healed the files, causing that the
 following 'ls' on the bricks showed the files as if nothing were
 damaged. If that's the case, it's possible that there were some
 disconnections during the copy.
 
 Added Pranith because he knows better replication and self-heal details.
 
 Xavi
 
 On 02/04/2015 07:23 PM, David F. Robinson wrote:
 Distributed/replicated
 
 Volume Name: homegfs
 Type: Distributed-Replicate
 Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
 Status: Started
 Number of Bricks: 4 x 2 = 8
 Transport-type: tcp
 Bricks:
 Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
 Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
 Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
 Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
 Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
 Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
 Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
 Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
 Options Reconfigured:
 performance.io-thread-count: 32
 performance.cache-size: 128MB
 performance.write-behind-window-size: 128MB
 server.allow-insecure: on
 network.ping-timeout: 10
 storage.owner-gid: 100
 geo-replication.indexing: off
 geo-replication.ignore-pid-check: on
 changelog.changelog: on
 changelog.fsync-interval: 3
 changelog.rollover-time: 15
 server.manage-gids: on
 
 
 -- Original Message --
 From: Xavier Hernandez xhernan...@datalab.es
 To: David F. Robinson david.robin...@corvidtec.com; Benjamin

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson
Isn't rsync what geo-rep uses?

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 5, 2015, at 5:41 PM, Ben Turner btur...@redhat.com wrote:
 
 - Original Message -
 From: Ben Turner btur...@redhat.com
 To: David F. Robinson david.robin...@corvidtec.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier Hernandez 
 xhernan...@datalab.es, Benjamin Turner
 bennytu...@gmail.com, gluster-us...@gluster.org, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:22:26 PM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files
 
 - Original Message -
 From: David F. Robinson david.robin...@corvidtec.com
 To: Ben Turner btur...@redhat.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier Hernandez
 xhernan...@datalab.es, Benjamin Turner
 bennytu...@gmail.com, gluster-us...@gluster.org, Gluster Devel
 gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:01:13 PM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files
 
 I'll send you the emails I sent Pranith with the logs. What causes these
 disconnects?
 
 Thanks David!  Disconnects happen when there are interruption in
 communication between peers, normally there is ping timeout that happens.
 It could be anything from a flaky NW to the system was to busy to respond
 to the pings.  My initial take is more towards the ladder as rsync is
 absolutely the worst use case for gluster - IIRC it writes in 4kb blocks.  I
 try to keep my writes at least 64KB as in my testing that is the smallest
 block size I can write with before perf starts to really drop off.  I'll try
 something similar in the lab.
 
 Ok I do think that the file being self healed is RCA for what you were 
 seeing.  Lets look at one of the disconnects:
 
 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
 [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection 
 from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
 
 And in the glustershd.log from the gfs01b_glustershd.log file:
 
 [2015-02-03 20:55:48.001797] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
 [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed entry selfheal on 
 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 
 [2015-02-03 20:55:49.343093] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
 [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed entry selfheal on 
 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 
 [2015-02-03 20:55:51.465289] I 
 [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
 0-homegfs-replicate-0: performing metadata selfheal on 
 403e661a-1c27-4e79-9867-c0572aba2b3c
 [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed metadata selfheal on 
 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
 [2015-02-03 20:55:51.467098] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
 [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed entry selfheal on 
 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 
 [2015-02-03 20:55:55.258548] I 
 [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
 0-homegfs-replicate-0: performing metadata selfheal on 
 c612ee2f-2fb4-4157-a9ab-5a2d5603c541
 [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 
 0-homegfs-replicate-0: Completed metadata selfheal on 
 c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 
 [2015-02-03 20:55:55.259980] I 
 [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: 
 performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
 
 As you can see the self heal logs are just spammed with files being healed, 
 and I looked at a couple of disconnects and I see self heals getting run 
 shortly after on the bricks that were down.  Now we need to find the cause of 
 the disconnects, I am thinking once the disconnects are resolved the files 
 should be properly copied over without SH having to fix things.  Like I said 
 I'll give this a go on my lab systems and see if I can repro the disconnects, 
 I'll have time to run through it tomorrow.  If in the mean time anyone else 
 has a theory / anything to add here it would be appreciated.
 
 -b
 
 -b
 
 David  (Sent from mobile)
 
 ===
 David F. Robinson, Ph.D

Re: [Gluster-devel] [Gluster-users] missing files

2015-02-05 Thread David F. Robinson

copy that.  Thanks for looking into the issue.

David


-- Original Message --
From: Benjamin Turner bennytu...@gmail.com
To: David F. Robinson david.robin...@corvidtec.com
Cc: Ben Turner btur...@redhat.com; Pranith Kumar Karampuri 
pkara...@redhat.com; Xavier Hernandez xhernan...@datalab.es; 
gluster-us...@gluster.org gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 2/5/2015 9:05:43 PM
Subject: Re: [Gluster-users] [Gluster-devel] missing files

Correct!  I have seen(back in the day, its been 3ish years since I have 
seen it) having say 50+ volumes each with a geo rep session take system 
load levels to the point where pings couldn't be serviced within the 
ping timeout.  So it is known to happen but there has been alot of work 
in the geo rep space to help here, some of which is discussed:


https://medium.com/@msvbhat/distributed-geo-replication-in-glusterfs-ec95f4393c50

(think tar + ssh and other fixes)Your symptoms remind me of that case 
of 50+ geo repd volumes, thats why I mentioned it from the start.  My 
current shoot from the hip theory is when rsyncing all that data the 
servers got too busy to service the pings and it lead to disconnects.  
This is common across all of the clustering / distributed software I 
have worked on, if the system gets too busy to service heartbeat within 
the timeout things go crazy(think fork bomb on a single host).  Now 
this could be a case of me putting symptoms from an old issue into what 
you are describing, but thats where my head is at.  If I'm correct I 
should be able to repro using a similar workload.  I think that the 
multi threaded epoll changes that _just_ landed in master will help 
resolve this, but they are so new I haven't been able to test this.  
I'll know more when I get a chance to test tomorrow.


-b

On Thu, Feb 5, 2015 at 6:04 PM, David F. Robinson 
david.robin...@corvidtec.com wrote:

Isn't rsync what geo-rep uses?

David  (Sent from mobile)

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Feb 5, 2015, at 5:41 PM, Ben Turner btur...@redhat.com wrote:

 - Original Message -
 From: Ben Turner btur...@redhat.com
 To: David F. Robinson david.robin...@corvidtec.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier 
Hernandez xhernan...@datalab.es, Benjamin Turner
 bennytu...@gmail.com, gluster-us...@gluster.org, Gluster Devel 
gluster-devel@gluster.org

 Sent: Thursday, February 5, 2015 5:22:26 PM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files

 - Original Message -
 From: David F. Robinson david.robin...@corvidtec.com
 To: Ben Turner btur...@redhat.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Xavier 
Hernandez

 xhernan...@datalab.es, Benjamin Turner
 bennytu...@gmail.com, gluster-us...@gluster.org, Gluster Devel
 gluster-devel@gluster.org
 Sent: Thursday, February 5, 2015 5:01:13 PM
 Subject: Re: [Gluster-users] [Gluster-devel] missing files

 I'll send you the emails I sent Pranith with the logs. What causes 
these

 disconnects?

 Thanks David!  Disconnects happen when there are interruption in
 communication between peers, normally there is ping timeout that 
happens.
 It could be anything from a flaky NW to the system was to busy to 
respond
 to the pings.  My initial take is more towards the ladder as rsync 
is
 absolutely the worst use case for gluster - IIRC it writes in 4kb 
blocks.  I
 try to keep my writes at least 64KB as in my testing that is the 
smallest
 block size I can write with before perf starts to really drop off.  
I'll try

 something similar in the lab.

 Ok I do think that the file being self healed is RCA for what you 
were seeing.  Lets look at one of the disconnects:


 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1


 And in the glustershd.log from the gfs01b_glustershd.log file:

 [2015-02-03 20:55:48.001797] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0: performing entry selfheal on 
6c79a368-edaa-432b-bef9-ec690ab42448
 [2015-02-03 20:55:49.341996] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. 
source=1 sinks=0
 [2015-02-03 20:55:49.343093] I 
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 
0-homegfs-replicate-0: performing entry selfheal on 
792cb0d6-9290-4447-8cd7-2b2d7a116a69
 [2015-02-03 20:55:50.463652] I 
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: 
Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. 
source=1 sinks=0
 [2015-02-03 20:55:51.465289] I 
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 
0-homegfs

Re: [Gluster-devel] missing files

2015-02-04 Thread David F. Robinson

Distributed/replicated

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on


-- Original Message --
From: Xavier Hernandez xhernan...@datalab.es
To: David F. Robinson david.robin...@corvidtec.com; Benjamin 
Turner bennytu...@gmail.com
Cc: gluster-us...@gluster.org gluster-us...@gluster.org; Gluster 
Devel gluster-devel@gluster.org

Sent: 2/4/2015 6:03:45 AM
Subject: Re: [Gluster-devel] missing files


On 02/04/2015 01:30 AM, David F. Robinson wrote:

Sorry. Thought about this a little more. I should have been clearer.
The files were on both bricks of the replica, not just one side. So,
both bricks had to have been up... The files/directories just don't 
show

up on the mount.
I was reading and saw a related bug
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
suggested to run:
 find mount -d -exec getfattr -h -n trusted.ec.heal {} \;


This command is specific for a dispersed volume. It won't do anything 
(aside from the error you are seeing) on a replicated volume.


I think you are using a replicated volume, right ?

In this case I'm not sure what can be happening. Is your volume a pure 
replicated one or a distributed-replicated ? on a pure replicated it 
doesn't make sense that some entries do not show in an 'ls' when the 
file is in both replicas (at least without any error message in the 
logs). On a distributed-replicated it could be caused by some problem 
while combining contents of each replica set.


What's the configuration of your volume ?

Xavi



I get a bunch of errors for operation not supported:
[root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n
trusted.ec.heal {} \;
find: warning: the -d option is deprecated; please use -depth instead,
because the latter is a POSIX-compliant feature.
wks_backup/homer_backup/backup: trusted.ec.heal: Operation not 
supported
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: 
Operation

not supported
wks_backup/homer_backup/logs: trusted.ec.heal: Operation not supported
wks_backup/homer_backup: trusted.ec.heal: Operation not supported
-- Original Message --
From: Benjamin Turner bennytu...@gmail.com 
mailto:bennytu...@gmail.com

To: David F. Robinson david.robin...@corvidtec.com
mailto:david.robin...@corvidtec.com
Cc: Gluster Devel gluster-devel@gluster.org
mailto:gluster-devel@gluster.org; gluster-us...@gluster.org
gluster-us...@gluster.org mailto:gluster-us...@gluster.org
Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files
It sounds to me like the files were only copied to one replica, 
werent

there for the initial for the initial ls which triggered a self heal,
and were there for the last ls because they were healed. Is there any
chance that one of the replicas was down during the rsync? It could
be that you lost a brick during copy or something like that. To
confirm I would look for disconnects in the brick logs as well as
checking glusterfshd.log to verify the missing files were actually
healed.

-b

On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
david.robin...@corvidtec.com mailto:david.robin...@corvidtec.com
wrote:

I rsync'd 20-TB over to my gluster system and noticed that I had
some directories missing even though the rsync completed 
normally.

The rsync logs showed that the missing files were transferred.
I went to the bricks and did an 'ls -al
/data/brick*/homegfs/dir/*' the files were on the bricks. After I
did this 'ls', the files then showed up on the FUSE mounts.
1) Why are the files hidden on the fuse mount?
2) Why does the ls make them show up on the FUSE mount?
3) How can I prevent this from happening again?
Note, I also mounted the gluster volume using NFS and saw

Re: [Gluster-devel] gluster 3.6.2 ls issues

2015-02-03 Thread David F. Robinson

Cancel this issue.  I found the problem.  Explanation below...

We use active directory to manage our user accounts; however, open sssd 
doesn't seem to play well with gluster.  When I turn it on, the cpu load 
shoots up to between 80-100% and stays there (previously submitted bug 
report).  So, we I did on my gluster machines to keep the uid/gid 
updated (required due to server.manage-gids=on), is write a script that 
start opensssd, grabs all of the groups/users from the server, parses 
out the /etc/group and /etc/passwd file, and then shuts down sssd.  I 
didn't realize that sssd uses the locally cached file.  My script was 
running faster than sssd was updating the cache file, so this particular 
user wasn't in the SBIR group on all of the machines.  He was in that 
group on gfs01a, but not on gfs01b (replica pair) or gfs02a/02b.  I 
guess this gave him enough permission to cd into the directory, but for 
some strange reason he couldn't do an ls and have the directory name 
show up.


The only reason I do any of this is because I had to use 
server.manage-gids to overcome the 32-group limitation.  This requires 
that my storage system have all of the user accounts and groups.  The 
preferred option would be to simply use sssd on my storage systems, but 
it doesn't seem to play well with gluster.


David


-- Original Message --
From: David F. Robinson david.robin...@corvidtec.com
To: Gluster Devel gluster-devel@gluster.org; 
gluster-us...@gluster.org gluster-us...@gluster.org

Sent: 2/3/2015 12:56:40 PM
Subject: gluster 3.6.2 ls issues

On my gluster filesystem mount, I have a user who does an ls and all 
of the directories do not show up.  Not that the A15-029 directory 
doesn't show up.  However, as kbutz I can cd into the directory.


As root (also tested as several other users), I get the following from 
an ls -al

[root@sb1 2015.1]# ls -al
total 16
drwxrws--x 13 streadway sbir   868 Feb  3 12:48 .
drwxrws--- 46 root  sbir 16384 Feb  3 10:50 ..
drwxrws--x  5 cczechsbir   606 Jan 30 12:58 A15-007
drwxrws--x  5 kbutz sbir   291 Feb  3 12:11 A15-029
drwxrws--x  3 randerson sbir   219 Feb  3 11:52 A15-063
drwxrws--x  4 abirnbaum sbir   223 Feb  3 10:14 A15-088
drwxrws--x  2 anelson   sbir   270 Jan 27 14:30 AF151-058
drwxrws--x  3 tanderson sbir   216 Jan 28 09:43 AF151-072
drwxrws--x  3 streadway sbir   162 Jan 21 13:28 AF151-102
drwxrws--x  4 aaronward sbir   493 Feb  3 09:58 AF151-114
drwxrws--x  3 streadway sbir   162 Feb  3 12:07 AF151-174
drwxrws--x  3 dstowesbir   192 Jan 27 12:25 AF15-AT28
drwxrws--x  3 kboyett   sbir   199 Jan 28 09:43 NASA
As user kburz, I get the following:
sb1:sbir/2015.1 ls -al
total 16
drwxrws--x 13 streadway sbir   868 Feb  3 12:48 ./
drwxrws--- 46 root  sbir 16384 Feb  3 10:50 ../
drwxrws--x  3 randerson sbir   219 Feb  3 11:52 A15-063/
drwxrws--x  4 abirnbaum sbir   223 Feb  3 10:14 A15-088/
drwxrws--x  2 anelson   sbir   270 Jan 27 14:30 AF151-058/
drwxrws--x  3 streadway sbir   162 Jan 21 13:28 AF151-102/
drwxrws--x  3 streadway sbir   162 Feb  3 12:07 AF151-174/
drwxrws--x  3 kboyett   sbir   199 Jan 28 09:43 NASA/
Note, that I can still cd into the non-listed directory as kbutz:

[kbutz@sb1 ~]$ cd /homegfs/documentation/programs/sbir/2015.1
A15-063/  A15-088/  AF151-058/  AF151-102/  AF151-174/  NASA/

sb1:sbir/2015.1 cd A15-029
A15-029_proposal_draft_rev1.docx*  CB_work/  gun_work/  Refs/

David

===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] missing files

2015-02-03 Thread David F. Robinson
I rsync'd 20-TB over to my gluster system and noticed that I had some 
directories missing even though the rsync completed normally.

The rsync logs showed that the missing files were transferred.

I went to the bricks and did an 'ls -al /data/brick*/homegfs/dir/*' the 
files were on the bricks.  After I did this 'ls', the files then showed 
up on the FUSE mounts.


1) Why are the files hidden on the fuse mount?
2) Why does the ls make them show up on the FUSE mount?
3) How can I prevent this from happening again?

Note, I also mounted the gluster volume using NFS and saw the same 
behavior.  The files/directories were not shown until I did the ls on 
the bricks.


David



===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] missing files

2015-02-03 Thread David F. Robinson
Sorry.  Thought about this a little more. I should have been clearer.  
The files were on both bricks of the replica, not just one side.  So, 
both bricks had to have been up... The files/directories just don't show 
up on the mount.


I was reading and saw a related bug 
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484).  I saw it 
suggested to run:


find mount -d -exec getfattr -h -n trusted.ec.heal {} \;


I get a bunch of errors for operation not supported:

[root@gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n 
trusted.ec.heal {} \;
find: warning: the -d option is deprecated; please use -depth instead, 
because the latter is a POSIX-compliant feature.

wks_backup/homer_backup/backup: trusted.ec.heal: Operation not supported
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: Operation 
not supported
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: Operation 
not supported
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: Operation 
not supported
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: Operation 
not supported
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: Operation 
not supported

wks_backup/homer_backup/logs: trusted.ec.heal: Operation not supported
wks_backup/homer_backup: trusted.ec.heal: Operation not supported

-- Original Message --
From: Benjamin Turner bennytu...@gmail.com
To: David F. Robinson david.robin...@corvidtec.com
Cc: Gluster Devel gluster-devel@gluster.org; 
gluster-us...@gluster.org gluster-us...@gluster.org

Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files

It sounds to me like the files were only copied to one replica, werent 
there for the initial for the initial ls which triggered a self heal, 
and were there for the last ls because they were healed.  Is there any 
chance that one of the replicas was down during the rsync?  It could be 
that you lost a brick during copy or something like that.  To confirm I 
would look for disconnects in the brick logs as well as checking 
glusterfshd.log to verify the missing files were actually healed.


-b

On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson 
david.robin...@corvidtec.com wrote:
I rsync'd 20-TB over to my gluster system and noticed that I had some 
directories missing even though the rsync completed normally.

The rsync logs showed that the missing files were transferred.

I went to the bricks and did an 'ls -al /data/brick*/homegfs/dir/*' 
the files were on the bricks.  After I did this 'ls', the files then 
showed up on the FUSE mounts.


1) Why are the files hidden on the fuse mount?
2) Why does the ls make them show up on the FUSE mount?
3) How can I prevent this from happening again?

Note, I also mounted the gluster volume using NFS and saw the same 
behavior.  The files/directories were not shown until I did the ls 
on the bricks.


David



===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] missing files

2015-02-03 Thread David F. Robinson

Like these?

data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
data-brick02a-homegfs.log:[2015-02-03 22:44:47.458905] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-29467-2015/02/03-22:44:05:838129-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 22:47:42.830866] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-30069-2015/02/03-22:47:37:209436-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 22:48:26.785931] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-30256-2015/02/03-22:47:55:203659-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 22:53:25.530836] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-30658-2015/02/03-22:53:21:627538-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 22:56:14.033823] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-30893-2015/02/03-22:56:01:450507-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 22:56:55.622800] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-31080-2015/02/03-22:56:32:665370-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 22:59:11.445742] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-31383-2015/02/03-22:58:45:190874-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 23:06:26.482709] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-31720-2015/02/03-23:06:11:340012-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 23:10:54.807725] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-32083-2015/02/03-23:10:22:131678-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 23:13:35.545513] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-32284-2015/02/03-23:13:21:26552-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-03 23:14:19.065271] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-32471-2015/02/03-23:13:48:221126-homegfs-client-2-0-0
data-brick02a-homegfs.log:[2015-02-04 00:18:20.261428] I 
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting 
connection from 
gfs01a.corvidtec.com-1369-2015/02/04-00:16:53:613570-homegfs-client-2-0-0


-- Original Message --
From: Benjamin Turner bennytu...@gmail.com
To: David F. Robinson david.robin...@corvidtec.com
Cc: Gluster Devel gluster-devel@gluster.org; 
gluster-us...@gluster.org gluster-us...@gluster.org

Sent: 2/3/2015 7:12:34 PM
Subject: Re: [Gluster-devel] missing files

It sounds to me like the files were only copied to one replica, werent 
there for the initial for the initial ls which triggered a self heal, 
and were there for the last ls because they were healed.  Is there any 
chance that one of the replicas was down during the rsync?  It could be 
that you lost a brick during copy or something like that.  To confirm I 
would look for disconnects in the brick logs as well as checking 
glusterfshd.log to verify the missing files were actually healed.


-b

On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson 
david.robin...@corvidtec.com wrote:
I rsync'd 20-TB over to my gluster system and noticed that I had some 
directories missing even though the rsync completed normally.

The rsync logs showed that the missing files were transferred.

I went to the bricks and did an 'ls -al /data/brick*/homegfs/dir/*' 
the files were on the bricks.  After I did this 'ls', the files then 
showed up on the FUSE mounts.


1) Why are the files hidden on the fuse mount?
2) Why does the ls

[Gluster-devel] failed heal

2015-02-01 Thread David F. Robinson
I have several files that gluster says it cannot heal.  I deleted the 
files from all of the bricks 
(/data/brick0*/hpc_shared/motorsports/gmics/Raven/p3/*) and ran a full 
heal using 'gluster volume heal homegfs full'.  Even after the full 
heal, the entries below still show up.

How do I clear these?


[root@gfs01a ~]# gluster volume heal homegfs info
Gathering list of entries to be healed on volume homegfs has been 
successful


Brick gfsib01a.corvidtec.com:/data/brick01a/homegfs
Number of entries: 10
/hpc_shared/motorsports/gmics/Raven/p3/70_rke/Movies
gfid:a6fc9011-74ad-4128-a232-4ccd41215ac8
gfid:bc17fa79-c1fd-483d-82b1-2c0d3564ddc5
gfid:ec804b5c-8bfc-4e7b-91e3-aded7952e609
gfid:ba62e340-4fad-477c-b450-704133577cbb
gfid:4843aa40-8361-4a97-88d5-d37fc28e04c0
gfid:c90a8f1c-c49e-4476-8a50-2bfb0a89323c
gfid:090042df-855a-4f5d-8929-c58feec10e33
/hpc_shared/motorsports/gmics/Raven/p3/70_rke/.Convrg.swp
/hpc_shared/motorsports/gmics/Raven/p3/70_rke

Brick gfsib01b.corvidtec.com:/data/brick01b/homegfs
Number of entries: 2
gfid:f96b4ddf-8a75-4abb-a640-15dbe41fdafa
/hpc_shared/motorsports/gmics/Raven/p3/70_rke

Brick gfsib01a.corvidtec.com:/data/brick02a/homegfs
Number of entries: 7
gfid:5d08fe1d-17b3-4a76-ab43-c708e346162f
/hpc_shared/motorsports/gmics/Raven/p3/70_rke/PICTURES/.tmpcheck
/hpc_shared/motorsports/gmics/Raven/p3/70_rke/PICTURES
/hpc_shared/motorsports/gmics/Raven/p3/70_rke/Movies
gfid:427d3738-3a41-4e51-ba2b-f0ba7254d013
gfid:8ad88a4d-8d5e-408f-a1de-36116cf6d5c1
gfid:0e034160-cd50-4108-956d-e45858f27feb

Brick gfsib01b.corvidtec.com:/data/brick02b/homegfs
Number of entries: 0

Brick gfsib02a.corvidtec.com:/data/brick01a/homegfs
Number of entries: 0

Brick gfsib02b.corvidtec.com:/data/brick01b/homegfs
Number of entries: 0

Brick gfsib02a.corvidtec.com:/data/brick02a/homegfs
Number of entries: 0

Brick gfsib02b.corvidtec.com:/data/brick02b/homegfs
Number of entries: 0


===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] v3.6.2

2015-01-27 Thread David F. Robinson
After shutting down all NFS and gluster processes, there was still an 
NFS process.


[root@gfs01bkp ~]# rpcinfo -p
   program vers proto   port  service
104   tcp111  portmapper
103   tcp111  portmapper
102   tcp111  portmapper
104   udp111  portmapper
103   udp111  portmapper
102   udp111  portmapper
153   tcp  38465  mountd
151   tcp  38466  mountd
133   tcp   2049  nfs
1000241   udp  34738  status
1000241   tcp  37269  status
[root@gfs01bkp ~]# netstat -anp | grep 2049
[root@gfs01bkp ~]# netstat -anp | grep 38465
[root@gfs01bkp ~]# netstat -anp | grep 38466

I killed off the processes using rpcinfo -d

[root@gfs01bkp ~]# rpcinfo -p
   program vers proto   port  service
104   tcp111  portmapper
103   tcp111  portmapper
102   tcp111  portmapper
104   udp111  portmapper
103   udp111  portmapper
102   udp111  portmapper
1000241   udp  34738  status
1000241   tcp  37269  status

Then I restarted the glusterd and did a 'mount -a'.  Worked perfectly.  
And the errors that were showing up in the logs every 3-seconds stopped.


Thanks for your help.  Greatly appreciated.

David




-- Original Message --
From: Xavier Hernandez xhernan...@datalab.es
To: David F. Robinson david.robin...@corvidtec.com; Kaushal M 
kshlms...@gmail.com
Cc: Gluster Users gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/27/2015 10:02:31 AM
Subject: Re: [Gluster-devel] [Gluster-users] v3.6.2


Hi,

I had a similar problem once. It happened after doing some unrelated 
tests with NFS. I thought it was a problem I generated doing weird 
things, so I didn't investigate the cause further.


To see if this is the same case, try this:

* Unmount all NFS mounts and stop all gluster volumes
* Check that there are no gluster processes running (ps ax | grep 
gluster), specially any glusterfs. glusterd is ok.

* Check that there are no NFS processes running (ps ax | grep nfs)
* Check with 'rpcinfo -p' that there's no nfs service registered

The output should be similar to this:

   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100024 1 udp 33482 status
100024 1 tcp 37034 status

If there are more services registered, you can directly delete them or 
check if they correspond to an active process. For example, if the 
output is this:


   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100021 3 udp 39618 nlockmgr
100021 3 tcp 41067 nlockmgr
100024 1 udp 33482 status
100024 1 tcp 37034 status

You can do a netstat -anp | grep 39618 to see if there is some 
process really listening at the nlockmgr port. You can repeat this for 
port 41067. If there is some process, you should stop it. If there is 
no process listening on that port, you should remove it with a command 
like this:


rpcinfo -d 100021 3

You must execute this command for all stale ports for any services 
other than portmapper and status. Once done you should get the output 
shown before.


After that, you can try to start your volume and see if everything is 
registered (rpcinfo -p) and if gluster has started the nfs server 
(gluster volume status).


If everything is ok, you should be able to mount the volume using NFS.

Xavi

On 01/27/2015 03:18 PM, David F. Robinson wrote:

Turning off nfslock did not help. Also, still getting these messages
every 3-seconds:

[2015-01-27 14:16:12.921880] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:15.922431] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:18.923080] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:21.923748] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:24.924472] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:27.925192] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:30.925895] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run

Re: [Gluster-devel] [Gluster-users] v3.6.2

2015-01-27 Thread David F. Robinson
I rebooted the machine to see if the problem would return and it does.  
Same issue after a reboot.

Any suggestions?

One other thing I tested was to comment out the NFS mounts in 
/etc/fstab:
# gfsib01bkp.corvidtec.com:/homegfs_bkp /backup_nfs/homegfs nfs 
vers=3,intr,bg,rsize=32768,wsize=32768 0 0
After the machine comes back up, I remove the comment and do a 'mount 
-a'.  The mount works fine.


It looks like it is a timing during startup issue.  Is it trying to do 
the NFS mount while glusterd is still starting up?


David


-- Original Message --
From: Xavier Hernandez xhernan...@datalab.es
To: David F. Robinson david.robin...@corvidtec.com; Kaushal M 
kshlms...@gmail.com
Cc: Gluster Users gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/27/2015 10:02:31 AM
Subject: Re: [Gluster-devel] [Gluster-users] v3.6.2


Hi,

I had a similar problem once. It happened after doing some unrelated 
tests with NFS. I thought it was a problem I generated doing weird 
things, so I didn't investigate the cause further.


To see if this is the same case, try this:

* Unmount all NFS mounts and stop all gluster volumes
* Check that there are no gluster processes running (ps ax | grep 
gluster), specially any glusterfs. glusterd is ok.

* Check that there are no NFS processes running (ps ax | grep nfs)
* Check with 'rpcinfo -p' that there's no nfs service registered

The output should be similar to this:

   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100024 1 udp 33482 status
100024 1 tcp 37034 status

If there are more services registered, you can directly delete them or 
check if they correspond to an active process. For example, if the 
output is this:


   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100021 3 udp 39618 nlockmgr
100021 3 tcp 41067 nlockmgr
100024 1 udp 33482 status
100024 1 tcp 37034 status

You can do a netstat -anp | grep 39618 to see if there is some 
process really listening at the nlockmgr port. You can repeat this for 
port 41067. If there is some process, you should stop it. If there is 
no process listening on that port, you should remove it with a command 
like this:


rpcinfo -d 100021 3

You must execute this command for all stale ports for any services 
other than portmapper and status. Once done you should get the output 
shown before.


After that, you can try to start your volume and see if everything is 
registered (rpcinfo -p) and if gluster has started the nfs server 
(gluster volume status).


If everything is ok, you should be able to mount the volume using NFS.

Xavi

On 01/27/2015 03:18 PM, David F. Robinson wrote:

Turning off nfslock did not help. Also, still getting these messages
every 3-seconds:

[2015-01-27 14:16:12.921880] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:15.922431] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:18.923080] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:21.923748] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:24.924472] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:27.925192] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:30.925895] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:33.926563] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:36.927248] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
-- Original Message --
From: Kaushal M kshlms...@gmail.com mailto:kshlms...@gmail.com
To: David F. Robinson david.robin...@corvidtec.com
mailto:david.robin...@corvidtec.com
Cc: Joe Julian j...@julianfamily.org mailto:j...@julianfamily.org;
Gluster Users gluster-us...@gluster.org
mailto:gluster-us...@gluster.org; Gluster Devel
gluster-devel@gluster.org mailto:gluster-devel@gluster.org
Sent: 1/27/2015 1:49:56 AM
Subject: Re: Re[2]: [Gluster-devel] [Gluster

Re: [Gluster-devel] [Gluster-users] v3.6.2

2015-01-27 Thread David F. Robinson

In my /etc/fstab, I have the following:

  gfsib01bkp.corvidtec.com:/homegfs_bkp  /backup/homegfs
glusterfs   transport=tcp,_netdev 0 0
  gfsib01bkp.corvidtec.com:/Software_bkp /backup/Software   
glusterfs   transport=tcp,_netdev 0 0
  gfsib01bkp.corvidtec.com:/Source_bkp   /backup/Source 
glusterfs   transport=tcp,_netdev 0 0


  #... Setup NFS mounts as well
  gfsib01bkp.corvidtec.com:/homegfs_bkp /backup_nfs/homegfs nfs 
vers=3,intr,bg,rsize=32768,wsize=32768 0 0



It looks like it is trying to start the nfs mount before gluster has 
finished coming up and that this is hanging the nfs ports.  I have 
_netdev in the glusterfs mount point to make sure the network has come 
up (including infiniband) prior to starting gluster.  Shouldn't the 
gluster init scripts check for gluster startup prior to starting the nfs 
mount?  It doesn't look like this is working properly.


David



-- Original Message --
From: Xavier Hernandez xhernan...@datalab.es
To: David F. Robinson david.robin...@corvidtec.com; Kaushal M 
kshlms...@gmail.com
Cc: Gluster Users gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/27/2015 10:02:31 AM
Subject: Re: [Gluster-devel] [Gluster-users] v3.6.2


Hi,

I had a similar problem once. It happened after doing some unrelated 
tests with NFS. I thought it was a problem I generated doing weird 
things, so I didn't investigate the cause further.


To see if this is the same case, try this:

* Unmount all NFS mounts and stop all gluster volumes
* Check that there are no gluster processes running (ps ax | grep 
gluster), specially any glusterfs. glusterd is ok.

* Check that there are no NFS processes running (ps ax | grep nfs)
* Check with 'rpcinfo -p' that there's no nfs service registered

The output should be similar to this:

   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100024 1 udp 33482 status
100024 1 tcp 37034 status

If there are more services registered, you can directly delete them or 
check if they correspond to an active process. For example, if the 
output is this:


   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100021 3 udp 39618 nlockmgr
100021 3 tcp 41067 nlockmgr
100024 1 udp 33482 status
100024 1 tcp 37034 status

You can do a netstat -anp | grep 39618 to see if there is some 
process really listening at the nlockmgr port. You can repeat this for 
port 41067. If there is some process, you should stop it. If there is 
no process listening on that port, you should remove it with a command 
like this:


rpcinfo -d 100021 3

You must execute this command for all stale ports for any services 
other than portmapper and status. Once done you should get the output 
shown before.


After that, you can try to start your volume and see if everything is 
registered (rpcinfo -p) and if gluster has started the nfs server 
(gluster volume status).


If everything is ok, you should be able to mount the volume using NFS.

Xavi

On 01/27/2015 03:18 PM, David F. Robinson wrote:

Turning off nfslock did not help. Also, still getting these messages
every 3-seconds:

[2015-01-27 14:16:12.921880] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:15.922431] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:18.923080] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:21.923748] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:24.924472] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:27.925192] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:30.925895] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:33.926563] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:36.927248] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
-- Original Message --
From: Kaushal M kshlms...@gmail.com

Re: [Gluster-devel] [Gluster-users] v3.6.2

2015-01-27 Thread David F. Robinson
Not elegant, but here is my short-term fix to prevent the issue after a 
reboot:


Added 'noauto' to the mount in /etc/rc.local:

/etc/fstab:
#... Note: Used the 'noauto' for the NFS mounts and put the mount in 
/etc/rc.local to ensure that
#... glsuter has been started before attempting to mount using NFS. 
Otherwise, hangs ports

#... during startup.
gfsib01bkp.corvidtec.com:/homegfs_bkp /backup_nfs/homegfs nfs 
vers=3,intr,bg,rsize=32768,wsize=32768,noauto 0 0
gfsib01a.corvidtec.com:/homegfs /homegfs_nfs nfs 
vers=3,intr,bg,rsize=32768,wsize=32768,noauto 0 0



/etc/rc.local:
/etc/init.d/glusterd restart
(sleep 20; mount -a; mount /backup_nfs/homegfs)



-- Original Message --
From: Xavier Hernandez xhernan...@datalab.es
To: David F. Robinson david.robin...@corvidtec.com; Kaushal M 
kshlms...@gmail.com
Cc: Gluster Users gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/27/2015 10:02:31 AM
Subject: Re: [Gluster-devel] [Gluster-users] v3.6.2


Hi,

I had a similar problem once. It happened after doing some unrelated 
tests with NFS. I thought it was a problem I generated doing weird 
things, so I didn't investigate the cause further.


To see if this is the same case, try this:

* Unmount all NFS mounts and stop all gluster volumes
* Check that there are no gluster processes running (ps ax | grep 
gluster), specially any glusterfs. glusterd is ok.

* Check that there are no NFS processes running (ps ax | grep nfs)
* Check with 'rpcinfo -p' that there's no nfs service registered

The output should be similar to this:

   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100024 1 udp 33482 status
100024 1 tcp 37034 status

If there are more services registered, you can directly delete them or 
check if they correspond to an active process. For example, if the 
output is this:


   program vers proto port service
10 4 tcp 111 portmapper
10 3 tcp 111 portmapper
10 2 tcp 111 portmapper
10 4 udp 111 portmapper
10 3 udp 111 portmapper
10 2 udp 111 portmapper
100021 3 udp 39618 nlockmgr
100021 3 tcp 41067 nlockmgr
100024 1 udp 33482 status
100024 1 tcp 37034 status

You can do a netstat -anp | grep 39618 to see if there is some 
process really listening at the nlockmgr port. You can repeat this for 
port 41067. If there is some process, you should stop it. If there is 
no process listening on that port, you should remove it with a command 
like this:


rpcinfo -d 100021 3

You must execute this command for all stale ports for any services 
other than portmapper and status. Once done you should get the output 
shown before.


After that, you can try to start your volume and see if everything is 
registered (rpcinfo -p) and if gluster has started the nfs server 
(gluster volume status).


If everything is ok, you should be able to mount the volume using NFS.

Xavi

On 01/27/2015 03:18 PM, David F. Robinson wrote:

Turning off nfslock did not help. Also, still getting these messages
every 3-seconds:

[2015-01-27 14:16:12.921880] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:15.922431] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:18.923080] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:21.923748] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:24.924472] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:27.925192] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:30.925895] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:33.926563] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
[2015-01-27 14:16:36.927248] W [socket.c:611:__socket_rwv] 
0-management:

readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed
(Invalid argument)
-- Original Message --
From: Kaushal M kshlms...@gmail.com mailto:kshlms...@gmail.com
To: David F. Robinson david.robin...@corvidtec.com
mailto:david.robin...@corvidtec.com
Cc: Joe Julian j...@julianfamily.org mailto:j...@julianfamily.org;
Gluster Users gluster-us...@gluster.org
mailto:gluster-us...@gluster.org; Gluster Devel

Re: [Gluster-devel] v3.6.2

2015-01-26 Thread David F. Robinson

Tried shutting down glusterd and glusterfsd and restarting.

[2015-01-26 14:52:53.548330] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.549763] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.551245] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.552819] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.554289] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.555769] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.564429] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-01-26 14:52:53.565578] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/0cdef7faa934cfe52676689ff8c0110f.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.566488] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/Software_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.567453] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/09e734d5e8d52bb796896c7a33d0a3ff.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.568248] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/Software_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.569009] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/3f6844c74682f39fa7457082119628c5.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.569851] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/Source_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.570818] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/34d5cc70aba63082bbb467ab450bd08b.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.571777] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/Source_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.572681] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/0cd747876dca36cb21ecc7a36f7f897c.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.573533] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.574433] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/88744e1365b414d41e720e480700716a.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.575399] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.575434] W [socket.c:611:__socket_rwv] 0-management: 
readv on /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed 
(Invalid argument)
[2015-01-26 14:52:53.575447] I [MSGID: 106006] 
[glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: 
nfs has disconnected from glusterd.
[2015-01-26 14:52:53.579663] I [glusterd-pmap.c:227:pmap_registry_bind] 
0-pmap: adding brick /data/brick01bkp/homegfs_bkp on port 49152
[2015-01-26 14:52:53.581943] I [glusterd-pmap.c:227:pmap_registry_bind] 
0-pmap: adding brick /data/brick02bkp/Source_bkp on port 49156
[2015-01-26 14:52:53.583487] I [glusterd-pmap.c:227:pmap_registry_bind] 
0-pmap: adding brick /data/brick01bkp/Source_bkp on port 49153
[2015-01-26 14:52:53.584921] I [glusterd-pmap.c:227:pmap_registry_bind] 
0-pmap: adding brick /data/brick02bkp/Software_bkp on port 49157
[2015-01-26 14:52:53.585719] I [glusterd-pmap.c:227:pmap_registry_bind] 
0-pmap: adding brick /data/brick01bkp/Software_bkp on port 49154
[2015-01-26 14:52:53.586281] I [glusterd-pmap.c:227:pmap_registry_bind] 
0-pmap: adding brick /data/brick02bkp/homegfs_bkp on port 49155




-- Original Message --
From: David F. Robinson david.robin...@corvidtec.com
To: gluster-us...@gluster.org gluster-us...@gluster.org; Gluster 
Devel gluster-devel@gluster.org

Sent: 1/26/2015 9:50:09 AM
Subject: v3.6.2

I have a server with v3.6.2 from which I cannot mount using NFS.  The 
FUSE mount works, however, I cannot get the NFS mount to work. From 
/var/log/message:


Jan 26 09:27:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
Jan 26 09:27:53 gfs01bkp mount[4456]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
Jan 26 09:29:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
Jan 26

Re: [Gluster-devel] v3.6.2

2015-01-26 Thread David F. Robinson

No firewall used on that machine.

[root@gfs01bkp ~]# /etc/init.d/iptables status
iptables: Firewall is not running.
[
[root@gfs01bkp ~]# cat /etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted





-- Original Message --
From: Justin Clift jus...@gluster.org
To: David F. Robinson david.robin...@corvidtec.com
Cc: Gluster Users gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/26/2015 11:11:15 AM
Subject: Re: [Gluster-devel] v3.6.2

On 26 Jan 2015, at 14:50, David F. Robinson 
david.robin...@corvidtec.com wrote:
 I have a server with v3.6.2 from which I cannot mount using NFS. The 
FUSE mount works, however, I cannot get the NFS mount to work. From 
/var/log/message:


 Jan 26 09:27:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:27:53 gfs01bkp mount[4456]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:29:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:29:53 gfs01bkp mount[4456]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:31:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:31:53 gfs01bkp mount[4456]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:33:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:33:53 gfs01bkp mount[4456]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:35:28 gfs01bkp mount[2810]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:35:53 gfs01bkp mount[4456]: mount to NFS server 
'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying



 I also am continually getting the following errors in 
/var/log/glusterfs:


 [root@gfs01bkp glusterfs]# tail -f etc-glusterfs-glusterd.vol.log
 [2015-01-26 14:41:51.260827] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:41:54.261240] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:41:57.261642] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:00.262073] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:03.262504] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:06.262935] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:09.263334] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:12.263761] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:15.264177] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:18.264623] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:21.265053] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
 [2015-01-26 14:42:24.265504] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)

 ^C

 Also, when I try to NFS mount my gluster volume, I am getting


Any chance there's a network or host based firewall stopping some of 
the ports?


+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] v3.6.2

2015-01-26 Thread David F. Robinson

Tried that... Still having errors starting gluster NFS...

From the /var/log/gluster/nfs.log file:


[2015-01-26 19:51:25.996481] I [MSGID: 100030] [glusterfsd.c:2018:main] 
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.2 
(args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p 
/var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket)
[2015-01-26 19:51:26.005501] I 
[rpcsvc.c:2142:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: 
Configured rpc.outstanding-rpc-limit with value 16
[2015-01-26 19:51:26.054144] E [nlm4.c:2481:nlm4svc_init] 0-nfs-NLM: 
unable to start /sbin/rpc.statd
[2015-01-26 19:51:26.054183] E [nfs.c:1342:init] 0-nfs: Failed to 
initialize protocols
[2015-01-26 19:51:26.054191] E [xlator.c:425:xlator_init] 0-nfs-server: 
Initialization of volume 'nfs-server' failed, review your volfile again
[2015-01-26 19:51:26.054198] E [graph.c:322:glusterfs_graph_init] 
0-nfs-server: initializing translator failed
[2015-01-26 19:51:26.054205] E [graph.c:525:glusterfs_graph_activate] 
0-graph: init failed
[2015-01-26 19:51:26.05] W [glusterfsd.c:1194:cleanup_and_exit] (-- 
0-: received signum (0), shutting down




-- Original Message --
From: Anatoly Pugachev mator...@gmail.com
To: David F. Robinson david.robin...@corvidtec.com
Cc: gluster-us...@gluster.org gluster-us...@gluster.org; Gluster 
Devel gluster-devel@gluster.org

Sent: 1/26/2015 2:48:08 PM
Subject: Re: [Gluster-users] v3.6.2


David,

can you stop glusterfs on affected machine and remove gluster related 
socket extension files from /var/run ? Start glusterfs service again 
and try once more ?


On Mon, Jan 26, 2015 at 5:57 PM, David F. Robinson 
david.robin...@corvidtec.com wrote:

Tried shutting down glusterd and glusterfsd and restarting.

[2015-01-26 14:52:53.548330] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.549763] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.551245] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.552819] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.554289] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.555769] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.564429] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.565578] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/0cdef7faa934cfe52676689ff8c0110f.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.566488] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/Software_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.567453] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/09e734d5e8d52bb796896c7a33d0a3ff.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.568248] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/Software_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.569009] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/3f6844c74682f39fa7457082119628c5.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.569851] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/Source_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.570818] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/34d5cc70aba63082bbb467ab450bd08b.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.571777] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/Source_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.572681] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/0cd747876dca36cb21ecc7a36f7f897c.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.573533] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.574433] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/88744e1365b414d41e720e480700716a.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.575399] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.575434] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run

Re: [Gluster-devel] [Gluster-users] v3.6.2

2015-01-26 Thread David F. Robinson

[root@gfs01bkp bricks]# ps -ef | grep rpcbind
rpc   2306 1  0 11:32 ?00:00:00 rpcbind
root  5265  4638  0 11:55 pts/000:00:00 grep rpcbind

-- Original Message --
From: Joe Julian j...@julianfamily.org
To: David F. Robinson david.robin...@corvidtec.com; 
gluster-us...@gluster.org gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/26/2015 11:55:09 AM
Subject: Re: [Gluster-users] v3.6.2


Is rpcbind running?

On January 26, 2015 6:57:44 AM PST, David F. Robinson 
david.robin...@corvidtec.com wrote:

Tried shutting down glusterd and glusterfsd and restarting.

[2015-01-26 14:52:53.548330] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.549763] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.551245] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.552819] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.554289] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.555769] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.564429] I 
[rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting 
frame-timeout to 600
[2015-01-26 14:52:53.565578] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/0cdef7faa934cfe52676689ff8c0110f.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.566488] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/Software_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.567453] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/09e734d5e8d52bb796896c7a33d0a3ff.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.568248] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/Software_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.569009] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/3f6844c74682f39fa7457082119628c5.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.569851] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/Source_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.570818] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/34d5cc70aba63082bbb467ab450bd08b.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.571777] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/Source_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.572681] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/0cd747876dca36cb21ecc7a36f7f897c.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.573533] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.574433] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/88744e1365b414d41e720e480700716a.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.575399] I [MSGID: 106005] 
[glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: 
Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp has 
disconnected from glusterd.
[2015-01-26 14:52:53.575434] W [socket.c:611:__socket_rwv] 
0-management: readv on 
/var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid 
argument)
[2015-01-26 14:52:53.575447] I [MSGID: 106006] 
[glusterd-handler.c:4257:__glusterd_nodesvc_rpc_notify] 0-management: 
nfs has disconnected from glusterd.
[2015-01-26 14:52:53.579663] I 
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick 
/data/brick01bkp/homegfs_bkp on port 49152
[2015-01-26 14:52:53.581943] I 
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick 
/data/brick02bkp/Source_bkp on port 49156
[2015-01-26 14:52:53.583487] I 
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick 
/data/brick01bkp/Source_bkp on port 49153
[2015-01-26 14:52:53.584921] I 
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick 
/data/brick02bkp/Software_bkp on port 49157
[2015-01-26 14:52:53.585719] I 
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick 
/data/brick01bkp/Software_bkp on port 49154
[2015-01-26 14:52:53.586281] I 
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick 
/data/brick02bkp/homegfs_bkp on port 49155




-- Original Message --
From: David F. Robinson david.robin...@corvidtec.com
To: gluster-us...@gluster.org gluster-us...@gluster.org; Gluster 
Devel gluster-devel@gluster.org

Re: [Gluster-devel] v3.6.2

2015-01-26 Thread David F. Robinson

[root@gfs01bkp ~]# gluster volume status homegfs_bkp
Status of volume: homegfs_bkp
Gluster process PortOnline  
Pid

--
Brick gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs
_bkp49152   Y   
4087

Brick gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs
_bkp49155   Y   
4092
NFS Server on localhost N/A N   
N/A


Task Status of Volume homegfs_bkp
--
Task : Rebalance
ID   : 6d4c6c4e-16da-48c9-9019-dccb7d2cfd66
Status   : completed




-- Original Message --
From: Atin Mukherjee amukh...@redhat.com
To: Pranith Kumar Karampuri pkara...@redhat.com; Justin Clift 
jus...@gluster.org; David F. Robinson david.robin...@corvidtec.com
Cc: Gluster Users gluster-us...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 1/26/2015 11:51:13 PM
Subject: Re: [Gluster-devel] v3.6.2




On 01/27/2015 07:33 AM, Pranith Kumar Karampuri wrote:


 On 01/26/2015 09:41 PM, Justin Clift wrote:

 On 26 Jan 2015, at 14:50, David F. Robinson
 david.robin...@corvidtec.com wrote:
 I have a server with v3.6.2 from which I cannot mount using NFS. 
The

 FUSE mount works, however, I cannot get the NFS mount to work. From
 /var/log/message:
   Jan 26 09:27:28 gfs01bkp mount[2810]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:27:53 gfs01bkp mount[4456]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:29:28 gfs01bkp mount[2810]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:29:53 gfs01bkp mount[4456]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:31:28 gfs01bkp mount[2810]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:31:53 gfs01bkp mount[4456]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:33:28 gfs01bkp mount[2810]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:33:53 gfs01bkp mount[4456]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:35:28 gfs01bkp mount[2810]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 Jan 26 09:35:53 gfs01bkp mount[4456]: mount to NFS server
 'gfsib01bkp.corvidtec.com' failed: Connection refused, retrying
 I also am continually getting the following errors in
 /var/log/glusterfs:
   [root@gfs01bkp glusterfs]# tail -f etc-glusterfs-glusterd.vol.log
 [2015-01-26 14:41:51.260827] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:41:54.261240] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:41:57.261642] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:00.262073] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:03.262504] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:06.262935] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:09.263334] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:12.263761] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:15.264177] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:18.264623] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:21.265053] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 [2015-01-26 14:42:24.265504] W [socket.c:611:__socket_rwv]
 0-management: readv on
 /var/run/1f0cee5a2d074e39b32ee5a81c70e68c.socket failed (Invalid
 argument)
 I believe this error message comes when the socket file is not 
present.

 I see the following commit which changed the location

Re: [Gluster-devel] 3.6.1 issue

2014-12-22 Thread David F. Robinson
That did not fix the issue (see below).  I also have run into another 
possibly related issue.  After untarring the boost directory and 
compiling the software, I cannot delete the source directory structure.  
It says direct not empty.


corvidpost5:temp3/gfs \rm -r boost_1_57_0
rm: cannot remove `boost_1_57_0/libs/numeric/odeint/test': Directory not 
empty

corvidpost5:temp3/gfs cd boost_1_57_0/libs/numeric/odeint/test/
corvidpost5:odeint/test ls -al
total 0
drwxr-x--- 3 dfrobins users  94 Dec 20 01:51 ./
drwx-- 3 dfrobins users 100 Dec 20 01:51 ../



cluster.read-hash-mode to 2 results

corvidpost5:TankExamples/DakotaList ls -al
total 5
drwxr-x--- 2 dfrobins users  166 Dec 22 11:16 ./
drwxr-x--- 6 dfrobins users  445 Dec 22 11:16 ../
lrwxrwxrwx 1 dfrobins users   25 Dec 22 11:16 EvalTank.py - 
../tank_model/EvalTank.py*

-- 1 dfrobins users0 Dec 22 11:16 FEMTank.py
-rwx--x--- 1 dfrobins users  734 Nov  7 11:05 RunTank.sh*
-rw--- 1 dfrobins users 1432 Nov  7 11:05 dakota_PandL_list.in
-rw--- 1 dfrobins users 1860 Nov  7 11:05 dakota_Ponly_list.in

gluster volume info homegfs

Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
Options Reconfigured:
cluster.read-hash-mode: 2
performance.stat-prefetch: off
performance.io-thread-count: 32
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
changelog.fsync-interval: 3
changelog.rollover-time: 15
server.manage-gids: on



-- Original Message --
From: Vijay Bellur vbel...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com
Cc: Justin Clift jus...@gluster.org; Gluster Devel 
gluster-devel@gluster.org

Sent: 12/22/2014 9:23:44 AM
Subject: Re: [Gluster-devel] 3.6.1 issue


On 12/21/2014 11:10 PM, David F. Robinson wrote:
So for now it is up to all of the individual users to know they cannot 
use tar without the -P switch if they are accessing a data storage 
system that uses gluster?




Setting volume option cluster.read-hash-mode to 2 could help here. Can 
you please check if this resolves the problem without -P switch?


-Vijay


On Dec 21, 2014, at 12:30 PM, Vijay Bellur vbel...@redhat.com 
wrote:



On 12/20/2014 12:09 PM, David F. Robinson wrote:
Seems to work with -xPf. I obviously couldn't check all of the 
files,
but the two specific ones that I noted in my original email do not 
show

any problems when using -P...


This is related to the way tar extracts symbolic links by default  
its interaction with GlusterFS. In a nutshell the following steps are 
involved in creation of symbolic links on the destination:


a) Create an empty regular placeholder file with permission bits set 
to 0 and the name being that of the symlink source file.


b) Record the device, inode numbers and the mtime of the placeholder 
file through stat.


c) After the first pass of extraction is complete, there is a second 
pass involved to set right symbolic links. In this phase a stat is 
performed on the placeholder file. If all attributes recorded in b) 
are in sync with the latest information from stat buf, only then the 
placeholder is unlinked and a new symbolic link is created. If any 
attribute is out of sync, the unlink and creation of symbolic link do 
not happen.


In the case of replicated GlusterFS volumes, the mtimes can vary 
across nodes during the creation of placeholder files. If the stat 
calls in steps b) and c) land on different nodes, then there is a 
very good likelihood that tar would skip creation of symbolic links 
and leave behind the placeholder files.


A little more detail about this particular implementation behavior of 
symlinks for tar can be found at [1].


To overcome this behavior, we can make use of the P switch with tar 
command during extraction which will create the link file directly 
and not go ahead with the above set of steps.


Keeping timestamps in sync across the cluster will help to an extent 
in preventing this situation. There are ongoing refinements in 
replicate's selection of read-child which will help in addressing 
this problem.


-Vijay

[1] http://lists.debian.org/debian-user/2003/03/msg03249.html




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 3.6.1 issue

2014-12-21 Thread David F. Robinson
So for now it is up to all of the individual users to know they cannot use tar 
without the -P switch if they are accessing a data storage system that uses 
gluster? 

David  (Sent from mobile)

===
David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310  [cell]
704.799.7974  [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

 On Dec 21, 2014, at 12:30 PM, Vijay Bellur vbel...@redhat.com wrote:
 
 On 12/20/2014 12:09 PM, David F. Robinson wrote:
 Seems to work with -xPf.  I obviously couldn't check all of the files,
 but the two specific ones that I noted in my original email do not show
 any problems when using -P...
 
 This is related to the way tar extracts symbolic links by default  its 
 interaction with GlusterFS. In a nutshell the following steps are involved in 
 creation of symbolic links on the destination:
 
 a) Create an empty regular placeholder file with permission bits set to 0 and 
 the name being that of the symlink source file.
 
 b) Record the device, inode numbers and the mtime of the placeholder file 
 through stat.
 
 c) After the first pass of extraction is complete, there is a second pass 
 involved to set right symbolic links. In this phase a stat is performed on 
 the placeholder file. If all attributes recorded in b) are in sync with the 
 latest information from stat buf, only then the placeholder is unlinked and a 
 new symbolic link is created. If any attribute is out of sync, the unlink and 
 creation of symbolic link do not happen.
 
 In the case of replicated GlusterFS volumes, the mtimes can vary across nodes 
 during the creation of placeholder files. If the stat calls in steps b) and 
 c) land on different nodes, then there is a very good likelihood that tar 
 would skip creation of symbolic links and leave behind the placeholder files.
 
 A little more detail about this particular implementation behavior of 
 symlinks for tar can be found at [1].
 
 To overcome this behavior, we can make use of the P switch with tar command 
 during extraction which will create the link file directly and not go ahead 
 with the above set of steps.
 
 Keeping timestamps in sync across the cluster will help to an extent in 
 preventing this situation. There are ongoing refinements in replicate's 
 selection of read-child which will help in addressing this problem.
 
 -Vijay
 
 [1] http://lists.debian.org/debian-user/2003/03/msg03249.html
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 3.6.1 issue

2014-12-19 Thread David F. Robinson
Seems to work with -xPf.  I obviously couldn't check all of the files, 
but the two specific ones that I noted in my original email do not show 
any problems when using -P...


David


-- Original Message --
From: Vijay Bellur vbel...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Justin Clift 
jus...@gluster.org; Gluster Devel gluster-devel@gluster.org

Sent: 12/20/2014 1:04:57 AM
Subject: Re: [Gluster-devel] 3.6.1 issue


On 12/16/2014 10:59 PM, David F. Robinson wrote:

Gluster 3.6.1 seems to be having an issue creating symbolic links. To
reproduce this issue, I downloaded the file
dakota-6.1-public.src_.tar.gz from
https://dakota.sandia.gov/download.html
# gunzip dakota-6.1-public.src_.tar.gz
# tar -xf dakota-6.1-public.src_.tar


Can you please try with tar -xPf ... and check the results?

Thanks,
Vijay



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] glusterfs-3.5.3beta1 has been released for testing

2014-10-06 Thread David F. Robinson
When I installed the 3.5.3beta on my HPC cluster, I get the following 
warnings during the mounts:


WARNING: getfattr not found, certain checks will be skipped..
I do not have attr installed on my compute nodes.  Is this something 
that I need in order for gluster to work properly or can this safely be 
ignored?


David





-- Original Message --
From: Niels de Vos nde...@redhat.com
To: gluster-us...@gluster.org; gluster-devel@gluster.org
Sent: 10/5/2014 8:44:59 AM
Subject: [Gluster-users] glusterfs-3.5.3beta1 has been released for 
testing



GlusterFS 3.5.3 (beta1) has been released and is now available for
testing. Get the tarball from here:
- 
http://bits.gluster.org/pub/gluster/glusterfs/src/glusterfs-3.5.3beta1.tar.gz


Packages for different distributions will land on the download server
over the next few days. When packages become available, the package
maintainers will send a notification to this list.

With this beta release, we make it possible for bug reporters and
testers to check if issues have indeed been fixed. All community 
members

are invited to test and/or comment on this release.

This release for the 3.5 stable series includes the following bug 
fixes:

- 1081016: glusterd needs xfsprogs and e2fsprogs packages
- 1129527: DHT :- data loss - file is missing on renaming same file 
from multiple client at same time
- 1129541: [DHT:REBALANCE]: Rebalance failures are seen with error 
message  remote operation failed: File exists
- 1132391: NFS interoperability problem: stripe-xlator removes EOF at 
end of READDIR

- 1133949: Minor typo in afr logging
- 1136221: The memories are exhausted quickly when handle the message 
which has multi fragments in a single record

- 1136835: crash on fsync
- 1138922: DHT + rebalance : rebalance process crashed + data loss + 
few Directories are present on sub-volumes but not visible on mount 
point + lookup is not healing directories
- 1139103: DHT + Snapshot :- If snapshot is taken when Directory is 
created only on hashed sub-vol; On restoring that snapshot Directory is 
not listed on mount point and lookup on parent is not healing
- 1139170: DHT :- rm -rf is not removing stale link file and because of 
that unable to create file having same name as stale link file
- 1139245: vdsm invoked oom-killer during rebalance and Killed process 
4305, UID 0, (glusterfs nfs process)
- 1140338: rebalance is not resulting in the hash layout changes being 
available to nfs client
- 1140348: Renaming file while rebalance is in progress causes data 
loss
- 1140549: DHT: Rebalance process crash after add-brick and `rebalance 
start' operation
- 1140556: Core: client crash while doing rename operations on the 
mount
- 1141558: AFR : gluster volume heal volume_name info prints some 
random characters
- 1141733: data loss when rebalance + renames are in progress and 
bricks from replica pairs goes down and comes back

- 1142052: Very high memory usage during rebalance
- 1142614: files with open fd's getting into split-brain when bricks 
goes offline and comes back online

- 1144315: core: all brick processes crash when quota is enabled
- 1145000: Spec %post server does not wait for the old glusterd to exit
- 1147243: nfs: volume set help says the rmtab file is in 
/var/lib/glusterd/rmtab


To get more information about the above bugs, go to
https://bugzilla.redhat.com, enter the bug number in the search box and
press enter.

If a bug from this list has not been sufficiently fixed, please open 
the

bug report, leave a comment with details of the testing and change the
status of the bug to ASSIGNED.

In case someone has successfully verified a fix for a bug, please 
change

the status of the bug to VERIFIED.

The release notes have been posted for review, and a blog post contains
an easier readable version:
- http://review.gluster.org/8903
- 
http://blog.nixpanic.net/2014/10/glusterfs-353beta1-has-been-released.html


Comments in bug reports, over email or on IRC (#gluster on Freenode) 
are

much appreciated.

Thanks for testing,
Niels

___
Gluster-users mailing list
gluster-us...@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Fw: Re: Corvid gluster testing

2014-08-07 Thread David F. Robinson
Just to clarify a little, there are two cases where I was evaluating 
performance.


1) The first case that Pranith was working involved 20-nodes with 
4-processors on each node for a total of 80-processors.  Each processor 
does its own independent i/o.  These files are roughly 100-200MB each 
and there are several hundred of them.  When I mounted the gluster 
system using fuse, it took 1.5-hours to do the i/o.  When I mounted the 
same system using NFS, it took 30-minutes.  Note, that in order to get 
the gluster mounted file-system down to 1.5-hours, I had to get rid of 
the replicated volume (this was done during troubleshooting with Pranith 
to rule out other possible issues).  The timing was significantly worse 
(3+ hours) when I was using a replicated pair.
2) The second case was the output of a larger single file (roughly 
2.5TB).  For this case, it takes the gluster mounted filesystem 
60-seconds (although I got that down to 52-seconds with some gluster 
parameter tuning).  The NFS mount takes 38-seconds.  I sent the results 
of this to the developer list first as this case is much easier to test 
(50-seconds versus what could be 3+ hours).


I am head out of town for a few days and will not be able to do 
additional testing until Monday.  For the second case, I will turn off 
cluster.eager-lock and send the results to the email list. If there is 
any other testing that you would like to see for the first case, let me 
know and I will be happy to perform the tests and send in the results...


Sorry for the confusion...

David


-- Original Message --
From: Pranith Kumar Karampuri pkara...@redhat.com
To: Anand Avati av...@gluster.org
Cc: David F. Robinson david.robin...@corvidtec.com; Gluster Devel 
gluster-devel@gluster.org

Sent: 8/6/2014 9:51:11 PM
Subject: Re: [Gluster-devel] Fw: Re: Corvid gluster testing



On 08/07/2014 07:18 AM, Anand Avati wrote:
It would be worth checking the perf numbers without -o acl (in case it 
was enabled, as seen in the other gid thread). Client side -o acl 
mount option can have a negative impact on performance because of the 
increased number of up-calls from FUSE for access().

Actually it is all write intensive.
here are the numbers they gave me from earlier runs:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls 
Fop
 -   ---   ---   ---    
   
  0.00   0.00 us   0.00 us   0.00 us 99 
 FORGET
  0.00   0.00 us   0.00 us   0.00 us   1093 
RELEASE
  0.00   0.00 us   0.00 us   0.00 us468  
RELEASEDIR
  0.00  60.00 us  26.00 us 107.00 us  4 
SETATTR
  0.00  91.56 us  42.00 us 157.00 us 27 
 UNLINK
  0.00  20.75 us  12.00 us  55.00 us132
GETXATTR
  0.00  19.03 us   9.00 us  95.00 us152
READLINK
  0.00  43.19 us  12.00 us 106.00 us 83 
   OPEN
  0.00  18.37 us   8.00 us  92.00 us257 
 STATFS
  0.00  32.42 us  11.00 us 118.00 us322 
OPENDIR
  0.00  36.09 us   5.00 us 109.00 us359 
  FSTAT
  0.00  51.14 us  37.00 us 183.00 us663 
 RENAME
  0.00  33.32 us   6.00 us 123.00 us   1451 
   STAT
  0.00 821.79 us  21.00 us   22678.00 us 84 
   READ
  0.00  34.88 us   3.00 us 139.00 us   2326 
  FLUSH
  0.01 789.33 us  72.00 us   64054.00 us347 
 CREATE
  0.011144.63 us  43.00 us  280735.00 us337   
FTRUNCATE
  0.01  47.82 us  16.00 us   19817.00 us  16513 
 LOOKUP
  0.02 604.85 us  11.00 us1233.00 us   1423
READDIRP
 99.95  17.51 us   6.00 us  212701.00 us  300715967 
  WRITE


Duration: 5390 seconds
   Data Read: 1495257497 bytes
Data Written: 166546887668 bytes

Pranith


Thanks


On Wed, Aug 6, 2014 at 6:26 PM, Pranith Kumar Karampuri 
pkara...@redhat.com wrote:


On 08/07/2014 06:48 AM, Anand Avati wrote:




On Wed, Aug 6, 2014 at 6:05 PM, Pranith Kumar Karampuri 
pkara...@redhat.com wrote:
We checked this performance with plain distribute as well and on 
nfs it gave 25 minutes where as on nfs it gave around 90 minutes 
after disabling throttling in both situations.


This sentence is very confusing. Can you please state it more 
clearly?

sorry :-D.
We checked this performance on plain distribute volume by disabling 
throttling.

On nfs the run took 25 minutes.
On fuse the run took 90 minutes.

Pranith



Thanks


I was wondering if any of you guys know what could contribute to 
this difference.


Pranith

On 08/07/2014 01:33 AM, Anand Avati wrote:
Seems like heavy FINODELK contention. As a diagnostic step

Re: [Gluster-devel] Fw: Re: Corvid gluster testing

2014-08-06 Thread David F. Robinson

Forgot to attach profile info in previous email.  Attached...

David


-- Original Message --
From: David F. Robinson david.robin...@corvidtec.com
To: gluster-devel@gluster.org
Sent: 8/5/2014 2:41:34 PM
Subject: Fw: Re: Corvid gluster testing

I have been testing some of the fixes that Pranith incorporated into 
the 3.5.2-beta to see how they performed for moderate levels of i/o. 
All of the stability issues that I had seen in previous versions seem 
to have been fixed in 3.5.2; however, there still seem to be some 
significant performance issues.  Pranith suggested that I send this to 
the gluster-devel email list, so here goes:


I am running an MPI job that saves a restart file to the gluster file 
system.  When I use the following in my fstab to mount the gluster 
volume, the i/o time for the 2.5GB file is roughly 45-seconds.


gfsib01a.corvidtec.com:/homegfs /homegfs glusterfs 
transport=tcp,_netdev 0 0
When I switch this to use the NFS protocol (see below), the i/o time is 
2.5-seconds.


  gfsib01a.corvidtec.com:/homegfs /homegfs nfs 
vers=3,intr,bg,rsize=32768,wsize=32768 0 0


The read-times for gluster are 10-20% faster than NFS, but the write 
times are almost 20x slower.


I am running SL 6.4 and glusterfs-3.5.2-0.1.beta1.el6.x86_64...

[root@gfs01a glusterfs]# gluster volume info homegfs
Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs

David

-- Forwarded Message --
From: Pranith Kumar Karampuri pkara...@redhat.com
To: David Robinson david.robin...@corvidtec.com
Cc: Young Thomas tom.yo...@corvidtec.com
Sent: 8/5/2014 2:25:38 AM
Subject: Re: Corvid gluster testing

gluster-devel@gluster.org is the email-id for the mailing list. We 
should probably start with the initial run numbers and the comparison 
for glusterfs mount and nfs mounts. May be something like


glusterfs mount: 90 minutes
nfs mount: 25 minutes

And profile outputs, volume config, number of mounts, hardware 
configuration should be a good start.


Pranith

On 08/05/2014 09:28 AM, David Robinson wrote:

Thanks pranith


===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

On Aug 4, 2014, at 11:22 PM, Pranith Kumar Karampuri 
pkara...@redhat.com wrote:




On 08/05/2014 08:33 AM, Pranith Kumar Karampuri wrote:

On 08/05/2014 08:29 AM, David F. Robinson wrote:

On 08/05/2014 12:51 AM, David F. Robinson wrote:
No. I don't want to use nfs. It eliminates most of the benefits 
of why I want to use gluster. Failover redundancy of the pair, 
load balancing, etc.
What is the meaning of 'Failover redundancy of the pair, load 
balancing ' Could you elaborate more? smb/nfs/glusterfs are just 
access protocols that gluster supports functionality is almost 
same

Here is my understanding. Please correct me where I am wrong.

With gluster, if I am doing a write and one of the replicated pairs 
goes down, there is no interruption to the I/o. The failover is 
handled by gluster and the fuse client. This isn't done if I use an 
nfs mount unless the component of the pair that goes down isn't the 
one I used for the mount.


With nfs, I will have to mount one of the bricks. So, if I have 
gfs01a, gfs01b, gfs02a, gfs02b, gfs03a, gfs03b, etc and my fstab 
mounts gfs01a, it is my understanding that all of my I/o will go 
through gfs01a which then gets distributed to all of the other 
bricks. Gfs01a throughput becomes a bottleneck. Where if I do a 
gluster mount using fuse, the load balancing is handled at the 
client side , not the server side. If I have 1000-nodes accessing 
20-gluster bricks, I need the load balancing aspect. I cannot have 
all traffic going through the network interface on a single brick.


If I am wrong with the above assumptions, I guess my question is 
why would one ever use the gluster mount instead of nfs and/or 
samba?


Tom: feel free to chime in if I have missed anything.
I see your point now. Yes the gluster server where you did the mount 
is kind of a bottle neck.
Now that we established the problem is in the clients/protocols, you 
should send out a detailed mail on gluster-devel and see if anyone 
can help with you on performance xlators that can improve it a bit 
more. My area of expertise is more on replication. I am 
sub-maintainer for replication,locks components. I also know 
connection management/io-threads related issues which lead to hangs 
as I worked on them before. Performance xlators are black box to me.


Performance xlators are enabled only on fuse gluster stack. On nfs 
server mounts we

[Gluster-devel] Fw: Re: Corvid gluster testing

2014-08-06 Thread David F. Robinson
I have been testing some of the fixes that Pranith incorporated into the 
3.5.2-beta to see how they performed for moderate levels of i/o. All of 
the stability issues that I had seen in previous versions seem to have 
been fixed in 3.5.2; however, there still seem to be some significant 
performance issues.  Pranith suggested that I send this to the 
gluster-devel email list, so here goes:


I am running an MPI job that saves a restart file to the gluster file 
system.  When I use the following in my fstab to mount the gluster 
volume, the i/o time for the 2.5GB file is roughly 45-seconds.


gfsib01a.corvidtec.com:/homegfs /homegfs glusterfs 
transport=tcp,_netdev 0 0
When I switch this to use the NFS protocol (see below), the i/o time is 
2.5-seconds.


  gfsib01a.corvidtec.com:/homegfs /homegfs nfs 
vers=3,intr,bg,rsize=32768,wsize=32768 0 0


The read-times for gluster are 10-20% faster than NFS, but the write 
times are almost 20x slower.


I am running SL 6.4 and glusterfs-3.5.2-0.1.beta1.el6.x86_64...

[root@gfs01a glusterfs]# gluster volume info homegfs
Volume Name: homegfs
Type: Distributed-Replicate
Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs

David

-- Forwarded Message --
From: Pranith Kumar Karampuri pkara...@redhat.com
To: David Robinson david.robin...@corvidtec.com
Cc: Young Thomas tom.yo...@corvidtec.com
Sent: 8/5/2014 2:25:38 AM
Subject: Re: Corvid gluster testing

gluster-devel@gluster.org is the email-id for the mailing list. We 
should probably start with the initial run numbers and the comparison 
for glusterfs mount and nfs mounts. May be something like


glusterfs mount: 90 minutes
nfs mount: 25 minutes

And profile outputs, volume config, number of mounts, hardware 
configuration should be a good start.


Pranith

On 08/05/2014 09:28 AM, David Robinson wrote:

Thanks pranith


===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com
http://www.corvidtechnologies.com

On Aug 4, 2014, at 11:22 PM, Pranith Kumar Karampuri 
pkara...@redhat.com wrote:




On 08/05/2014 08:33 AM, Pranith Kumar Karampuri wrote:

On 08/05/2014 08:29 AM, David F. Robinson wrote:

On 08/05/2014 12:51 AM, David F. Robinson wrote:
No. I don't want to use nfs. It eliminates most of the benefits of 
why I want to use gluster. Failover redundancy of the pair, load 
balancing, etc.
What is the meaning of 'Failover redundancy of the pair, load 
balancing ' Could you elaborate more? smb/nfs/glusterfs are just 
access protocols that gluster supports functionality is almost same

Here is my understanding. Please correct me where I am wrong.

With gluster, if I am doing a write and one of the replicated pairs 
goes down, there is no interruption to the I/o. The failover is 
handled by gluster and the fuse client. This isn't done if I use an 
nfs mount unless the component of the pair that goes down isn't the 
one I used for the mount.


With nfs, I will have to mount one of the bricks. So, if I have 
gfs01a, gfs01b, gfs02a, gfs02b, gfs03a, gfs03b, etc and my fstab 
mounts gfs01a, it is my understanding that all of my I/o will go 
through gfs01a which then gets distributed to all of the other 
bricks. Gfs01a throughput becomes a bottleneck. Where if I do a 
gluster mount using fuse, the load balancing is handled at the 
client side , not the server side. If I have 1000-nodes accessing 
20-gluster bricks, I need the load balancing aspect. I cannot have 
all traffic going through the network interface on a single brick.


If I am wrong with the above assumptions, I guess my question is why 
would one ever use the gluster mount instead of nfs and/or samba?


Tom: feel free to chime in if I have missed anything.
I see your point now. Yes the gluster server where you did the mount 
is kind of a bottle neck.
Now that we established the problem is in the clients/protocols, you 
should send out a detailed mail on gluster-devel and see if anyone can 
help with you on performance xlators that can improve it a bit more. 
My area of expertise is more on replication. I am sub-maintainer for 
replication,locks components. I also know connection 
management/io-threads related issues which lead to hangs as I worked 
on them before. Performance xlators are black box to me.


Performance xlators are enabled only on fuse gluster stack. On nfs 
server mounts we disable all the performance xlators except 
write-behind as nfs client does lots of things for improving 
performance. I suggest you guys follow up more on gluster-devel.


Appreciate all the help you did for improving the product :-). Thanks 
a ton!

Pranith