Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Atin Mukherjee
-Atin
Sent from one plus one
On 29-Apr-2016 9:36 PM, "Ashish Pandey"  wrote:
>
>
> Hi Jeff,
>
> Where can we find the core dump?
>
> ---
> Ashish
>
> 
> From: "Pranith Kumar Karampuri" 
> To: "Jeff Darcy" 
> Cc: "Gluster Devel" , "Ashish Pandey" <
aspan...@redhat.com>
> Sent: Thursday, April 28, 2016 11:58:54 AM
> Subject: Re: [Gluster-devel] Regression-test-burn-in crash in EC test
>
>
> Ashish,
>Could you take a look at this?
>
> Pranith
>
> - Original Message -
> > From: "Jeff Darcy" 
> > To: "Gluster Devel" 
> > Sent: Wednesday, April 27, 2016 11:31:25 PM
> > Subject: [Gluster-devel] Regression-test-burn-in crash in EC test
> >
> > One of the "rewards" of reviewing and merging people's patches is
getting
> > email if the next regression-test-burn-in should fail - even if it
fails for
> > a completely unrelated reason.  Today I got one that's not among the
usual
> > suspects.  The failure was a core dump in
tests/bugs/disperse/bug-1304988.t,
> > weighing in at a respectable 42 frames.
> >
> > #0  0x7fef25976cb9 in dht_rename_lock_cbk
> > #1  0x7fef25955f62 in dht_inodelk_done
> > #2  0x7fef25957352 in dht_blocking_inodelk_cbk
> > #3  0x7fef32e02f8f in default_inodelk_cbk
> > #4  0x7fef25c029a3 in ec_manager_inodelk
> > #5  0x7fef25bf9802 in __ec_manager
> > #6  0x7fef25bf990c in ec_manager
> > #7  0x7fef25c03038 in ec_inodelk
> > #8  0x7fef25bee7ad in ec_gf_inodelk
> > #9  0x7fef25957758 in dht_blocking_inodelk_rec
> > #10 0x7fef25957b2d in dht_blocking_inodelk
> > #11 0x7fef2597713f in dht_rename_lock
> > #12 0x7fef25977835 in dht_rename
> > #13 0x7fef32e0f032 in default_rename
> > #14 0x7fef32e0f032 in default_rename
> > #15 0x7fef32e0f032 in default_rename
> > #16 0x7fef32e0f032 in default_rename
> > #17 0x7fef32e0f032 in default_rename
> > #18 0x7fef32e07c29 in default_rename_resume
> > #19 0x7fef32d8ed40 in call_resume_wind
> > #20 0x7fef32d98b2f in call_resume
> > #21 0x7fef24cfc568 in open_and_resume
> > #22 0x7fef24cffb99 in ob_rename
> > #23 0x7fef24aee482 in mdc_rename
> > #24 0x7fef248d68e5 in io_stats_rename
> > #25 0x7fef32e0f032 in default_rename
> > #26 0x7fef2ab1b2b9 in fuse_rename_resume
> > #27 0x7fef2ab12c47 in fuse_fop_resume
> > #28 0x7fef2ab107cc in fuse_resolve_done
> > #29 0x7fef2ab108a2 in fuse_resolve_all
> > #30 0x7fef2ab10900 in fuse_resolve_continue
> > #31 0x7fef2ab0fb7c in fuse_resolve_parent
> > #32 0x7fef2ab1077d in fuse_resolve
> > #33 0x7fef2ab10879 in fuse_resolve_all
> > #34 0x7fef2ab10900 in fuse_resolve_continue
> > #35 0x7fef2ab0fb7c in fuse_resolve_parent
> > #36 0x7fef2ab1077d in fuse_resolve
> > #37 0x7fef2ab10824 in fuse_resolve_all
> > #38 0x7fef2ab1093e in fuse_resolve_and_resume
> > #39 0x7fef2ab1b40e in fuse_rename
> > #40 0x7fef2ab2a96a in fuse_thread_proc
> > #41 0x7fef3204daa1 in start_thread
> >
> > In other words we started at FUSE, went through a bunch of performance
> > translators, through DHT to EC, and then crashed on the way back.  It
seems
> > a little odd that we turn the fop around immediately in EC, and that we
have
> > default_inodelk_cbk at frame 3.  Could one of the DHT or EC people
please
> > take a look at it?  Thanks!
> >
> >
> > https://build.gluster.org/job/regression-test-burn-in/868/console
This is the one.
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] [IMPORTANT] Adding release notes for 3.8 features

2016-04-29 Thread Jiffin Tony Thottan

 Hi all,

The branching for 3.8 will happen on April 30th, 2016. Since we are 
approaching the last stage for
3.8 release a public pad [1] is created for adding release notes. It 
should mention about major changes
that may impact overall working of a feature. For example we are 
planning to depreciate gluster nfs
from 3.8, i.e when volume get started, nfs server won't start by 
default.The user need to turn off

"nfs.disable" option to bring up gluster nfs.

So I kindly request all the feature owners to update the release note 
about their feature.


Also please update your progress on 3.8 features in the roadmap [2]

[1] https://public.pad.fsfe.org/p/glusterfs-3.8-release-notes
[2] https://www.gluster.org/community/roadmap/3.8/

Thanks,
Niels & Jiffin



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] .glusterfs grown larger than volume content

2016-04-29 Thread Vincent Huynh
Hello,

We've noticed that the .glusterfs directory is larger than the contents of the 
volume. Our application only has access through the client so I don't suspect 
anything was deleted on the brick.

# du -sh .glusterfs
31G .glusterfs/
# du -sh *
13G dir1
31M dir2

How could we have come into this state? Is there a way to find what is orphaned?

We tried looking for any references to deleted files but it didn't seem to 
yield much:
# find .glusterfs -links 1 -ls
2211630 lrwxrwxrwx   1 root root   51 Mar  4 14:34 
.glusterfs/91/ff/91ffa-f20f-4933-a8d6-abx93074 -> 
../../00/00/----0001/dir2
3835150 lrwxrwxrwx   1 root root   59 Mar  4 15:08 
.glusterfs/b1/2d/bd5b5-e00c-4bd1-95c6-312a25 -> 
../../7e/85/7cxxx90-88e9-4cdd-95fd-dd48/recyclebin
4494050 lrwxrwxrwx   1 root root   51 Mar  4 15:08 
.glusterfs/21/28/2102-101e-4177-b775-74379ba -> 
../../00/00/----0001/dir2
3941500 lrwxrwxrwx   1 root root   59 Apr  4 13:24 
.glusterfs/c7/2b/c728-877-49a-b7d-3b3149c -> 
../../e1/09/e10xx94e-c5xcd-4c1f-95f-4824106e/recyclebin
2299340 lrwxrwxrwx   1 root root   60 Mar  4 15:08 
.glusterfs/00/00/----0006 -> 
../../00/00/----0005/internal_op
2129310 lrwxrwxrwx   1 root root 8 Mar  4 15:08 
.glusterfs/00/00/----0001 -> ../../..
4775410 lrwxrwxrwx   1 root root   58 Mar  4 15:08 
.glusterfs/00/00/----0005 -> 
../../00/00/----0001/.trashcan
3850480 lrwxrwxrwx   1 root root   55 Mar 23 12:02 
.glusterfs/b3/21/b3xxb20-4b23-4e93-8db4-3dxx8x6e -> 
../../e1/09/e10084e-c5cd-4c1f-95f-482106e/videos
2199364 -rw-r--r--   1 root root19 Apr 27 10:54 
.glusterfs/health_check
2640270 --   1 root root0 Apr 26 13:01 
.glusterfs/indices/xattrop/xattrop-2198-d683-431-bxx2-103474
2129410 lrwxrwxrwx   1 root root   51 Mar  4 14:24 
.glusterfs/e1/09/e1xxx4e-c5d-4c1f-95f-482xe -> 
../../00/00/----0001/dir1
3976650 lrwxrwxrwx   1 root root   51 Mar  4 15:08 
.glusterfs/7e/85/757c90-8e9-4cdd-95fd-dd48 -> 
../../00/00/----0001/dir1
270337   20 -rw-r--r--   1 root root20480 Dec 14 23:03 
.glusterfs/data.db

We are running on a single node but when I added a second node and perform a 
full heal, the .glusterfs directory size is the same as the volume content size 
which is that we expected.

Version: glusterfs 3.7.3
OS: CentOS 5

Any advice would be much appreciated!

Thanks!
Vincent-
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Ashish Pandey

Hi Jeff, 

Where can we find the core dump? 

--- 
Ashish 

- Original Message -

From: "Pranith Kumar Karampuri"  
To: "Jeff Darcy"  
Cc: "Gluster Devel" , "Ashish Pandey" 
 
Sent: Thursday, April 28, 2016 11:58:54 AM 
Subject: Re: [Gluster-devel] Regression-test-burn-in crash in EC test 

Ashish, 
Could you take a look at this? 

Pranith 

- Original Message - 
> From: "Jeff Darcy"  
> To: "Gluster Devel"  
> Sent: Wednesday, April 27, 2016 11:31:25 PM 
> Subject: [Gluster-devel] Regression-test-burn-in crash in EC test 
> 
> One of the "rewards" of reviewing and merging people's patches is getting 
> email if the next regression-test-burn-in should fail - even if it fails for 
> a completely unrelated reason. Today I got one that's not among the usual 
> suspects. The failure was a core dump in tests/bugs/disperse/bug-1304988.t, 
> weighing in at a respectable 42 frames. 
> 
> #0 0x7fef25976cb9 in dht_rename_lock_cbk 
> #1 0x7fef25955f62 in dht_inodelk_done 
> #2 0x7fef25957352 in dht_blocking_inodelk_cbk 
> #3 0x7fef32e02f8f in default_inodelk_cbk 
> #4 0x7fef25c029a3 in ec_manager_inodelk 
> #5 0x7fef25bf9802 in __ec_manager 
> #6 0x7fef25bf990c in ec_manager 
> #7 0x7fef25c03038 in ec_inodelk 
> #8 0x7fef25bee7ad in ec_gf_inodelk 
> #9 0x7fef25957758 in dht_blocking_inodelk_rec 
> #10 0x7fef25957b2d in dht_blocking_inodelk 
> #11 0x7fef2597713f in dht_rename_lock 
> #12 0x7fef25977835 in dht_rename 
> #13 0x7fef32e0f032 in default_rename 
> #14 0x7fef32e0f032 in default_rename 
> #15 0x7fef32e0f032 in default_rename 
> #16 0x7fef32e0f032 in default_rename 
> #17 0x7fef32e0f032 in default_rename 
> #18 0x7fef32e07c29 in default_rename_resume 
> #19 0x7fef32d8ed40 in call_resume_wind 
> #20 0x7fef32d98b2f in call_resume 
> #21 0x7fef24cfc568 in open_and_resume 
> #22 0x7fef24cffb99 in ob_rename 
> #23 0x7fef24aee482 in mdc_rename 
> #24 0x7fef248d68e5 in io_stats_rename 
> #25 0x7fef32e0f032 in default_rename 
> #26 0x7fef2ab1b2b9 in fuse_rename_resume 
> #27 0x7fef2ab12c47 in fuse_fop_resume 
> #28 0x7fef2ab107cc in fuse_resolve_done 
> #29 0x7fef2ab108a2 in fuse_resolve_all 
> #30 0x7fef2ab10900 in fuse_resolve_continue 
> #31 0x7fef2ab0fb7c in fuse_resolve_parent 
> #32 0x7fef2ab1077d in fuse_resolve 
> #33 0x7fef2ab10879 in fuse_resolve_all 
> #34 0x7fef2ab10900 in fuse_resolve_continue 
> #35 0x7fef2ab0fb7c in fuse_resolve_parent 
> #36 0x7fef2ab1077d in fuse_resolve 
> #37 0x7fef2ab10824 in fuse_resolve_all 
> #38 0x7fef2ab1093e in fuse_resolve_and_resume 
> #39 0x7fef2ab1b40e in fuse_rename 
> #40 0x7fef2ab2a96a in fuse_thread_proc 
> #41 0x7fef3204daa1 in start_thread 
> 
> In other words we started at FUSE, went through a bunch of performance 
> translators, through DHT to EC, and then crashed on the way back. It seems 
> a little odd that we turn the fop around immediately in EC, and that we have 
> default_inodelk_cbk at frame 3. Could one of the DHT or EC people please 
> take a look at it? Thanks! 
> 
> 
> https://build.gluster.org/job/regression-test-burn-in/868/console 
> ___ 
> Gluster-devel mailing list 
> Gluster-devel@gluster.org 
> http://www.gluster.org/mailman/listinfo/gluster-devel 
> 

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Gluster Download link

2016-04-29 Thread Shakti Rathore
Hi,



I am unable to install GlusterFS-Server from the installation instruction
provided here:



http://www.gluster.org/community/documentation/index.php/Getting_started_install



I am using Centos 7 and when I enter the command:



wget -P /etc/yum.repos.d
http://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL/glusterfs-epel.repo



I get the message that it was not found.



I manually search for the repos link as it seems to have changed, and after
entering it if I run “ yum install glusterfs” it again fails as it says
that index file was not found.



Something is wrong. Please help in fixing it. Please look at the links and
make required corrections so I can install glusterfs server as system seems
to be broken.



Thank you.



Shakti Rathore
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] netbsd smoke failure

2016-04-29 Thread Michael Scherer
Le vendredi 29 avril 2016 à 10:05 -0400, Susant Palai a écrit :
> Hi All,
>   On many of my patches the following error is seen from netbsd smoke test.
> 
> Triggered by Gerrit: http://review.gluster.org/13993
> Building remotely on netbsd0.cloud.gluster.org (netbsd_build) in workspace 
> /home/jenkins/root/workspace/netbsd6-smoke
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url git://review.gluster.org/glusterfs.git # 
> timeout=10
> ERROR: Error fetching remote repo 'origin'
> hudson.plugins.git.GitException: Failed to fetch from 
> git://review.gluster.org/glusterfs.git
>   at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
>   at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1066)
>   at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1097)
>   at hudson.scm.SCM.checkout(SCM.java:485)
>   at hudson.model.AbstractProject.checkout(AbstractProject.java:1269)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
>   at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
>   at hudson.model.Run.execute(Run.java:1738)
>   at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
>   at hudson.model.ResourceController.execute(ResourceController.java:98)
>   at hudson.model.Executor.run(Executor.java:410)
> Caused by: hudson.plugins.git.GitException: Command "git config 
> remote.origin.url git://review.gluster.org/glusterfs.git" returned status 
> code 255:
> stdout: 
> stderr: error: could not lock config file .git/config: File exists
> 
> 
> 
> Please let me know how this can be resolved.

So as I didn't add access, I ran a groovy script to add my key as root.
Then i did reboot the server (since some old process were running), and
then removed the file causing trouble ( .git/config.lock in the repo ).

I will reenable the builder back soon.
-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS




signature.asc
Description: This is a digitally signed message part
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] gluster 3.7.9 permission denied and mv errors

2016-04-29 Thread Glomski, Patrick
Raghavendra,

This error is occurring in a shell script moving files between directories
on a FUSE mount when overwriting an old file with a newer file (it's a
backup script, moving an incremental backup of a file into a 'rolling full
backup' directory).

As a temporary workaround, we parse the output of this shell script for
move errors and handle the errors as they happen. Simply re-moving the
files fails, so we stat the destination (to see if we can learn anything
about the type of file that causes this behavior), delete the destination,
and try the move again (success!). Typical output is as follows:

/bin/mv: cannot move
`./homegfs/hpc_shared/motorsports/gmics/Raven/p11/149/data_collected4'
> to `../bkp00/./homegfs/hpc_shared/motorsports/gmics/
> Raven/p11/149/data_collected4': File exists
> /bin/mv: cannot move 
> `./homegfs/hpc_shared/motorsports/gmics/Raven/p11/149/data_collected4'
> to `../bkp00/./homegfs/hpc_shared/motorsports/gmics/
> Raven/p11/149/data_collected4': File exists
>   File: `../bkp00/./homegfs/hpc_shared/motorsports/gmics/
> Raven/p11/149/data_collected4'
>   Size: 1714Blocks: 4  IO Block: 131072 regular file
> Device: 13h/19d Inode: 11051758947722304158  Links: 1
> Access: (0660/-rw-rw)  Uid: (  628/pkeistler)   Gid: ( 2020/   gmirl)
> Access: 2016-01-20 17:20:45.0 -0500
> Modify: 2015-11-06 15:20:41.0 -0500
> Change: 2016-01-27 03:35:00.434712146 -0500
> retry: renaming 
> ./homegfs/hpc_shared/motorsports/gmics/Raven/p11/149/data_collected4
> -> ../bkp00/./homegfs/hpc_shared/motorsports/gmics/Raven/p11/
> 149/data_collected4
>

Not sure if that description rings any bells as to what the problem might
be, but if not, I added some code to print out the 'getattr' for the source
and destination file on all of the bricks (before we delete the
destination) and will post to this thread the next time we have that issue.

Thanks,
Patrick


On Fri, Apr 29, 2016 at 8:15 AM, Raghavendra G 
wrote:

>
>
> On Wed, Apr 13, 2016 at 10:00 PM, David F. Robinson <
> david.robin...@corvidtec.com> wrote:
>
>> I am running into two problems (possibly related?).
>>
>> 1) Every once in a while, when I do a 'rm -rf DIRNAME', it comes back
>> with an error:
>> rm: cannot remove `DIRNAME` : Directory not empty
>>
>> If I try the 'rm -rf' again after the error, it deletes the
>> directory.  The issue is that I have scripts that clean up directories, and
>> they are failing unless I go through the deletes a 2nd time.
>>
>
> What kind of mount are you using? Is it a FUSE or NFS mount? Recently we
> saw a similar issue on NFS clients on RHEL6 where rm -rf used to fail with
> ENOTEMPTY in some specific cases.
>
>
>>
>> 2) I have different scripts to move a large numbers of files (5-25k) from
>> one directory to another.  Sometimes I receive an error:
>> /bin/mv: cannot move `xyz` to `../bkp00/xyz`: File exists
>>
>
> Does ./bkp00/xyz exist on backend? If yes, what is the value of gfid xattr
> (key: "trusted.gfid") for "xyz" and "./bkp00/xyz" on backend bricks (I need
> gfid from all the bricks) when this issue happens?
>
>
>> The move is done using '/bin/mv -f', so it should overwrite the file
>> if it exists.  I have tested this with hundreds of files, and it works as
>> expected.  However, every few days the script that moves the files will
>> have problems with 1 or 2 files during the move.  This is one move problem
>> out of roughly 10,000 files that are being moved and I cannot figure out
>> any reason for the intermittent problem.
>>
>> Setup details for my gluster configuration shown below.
>>
>> [root@gfs01bkp logs]# gluster volume info
>>
>> Volume Name: gfsbackup
>> Type: Distribute
>> Volume ID: e78d5123-d9bc-4d88-9c73-61d28abf0b41
>> Status: Started
>> Number of Bricks: 7
>> Transport-type: tcp
>> Bricks:
>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/gfsbackup
>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/gfsbackup
>> Brick3: gfsib02bkp.corvidtec.com:/data/brick01bkp/gfsbackup
>> Brick4: gfsib02bkp.corvidtec.com:/data/brick02bkp/gfsbackup
>> Brick5: gfsib02bkp.corvidtec.com:/data/brick03bkp/gfsbackup
>> Brick6: gfsib02bkp.corvidtec.com:/data/brick04bkp/gfsbackup
>> Brick7: gfsib02bkp.corvidtec.com:/data/brick05bkp/gfsbackup
>> Options Reconfigured:
>> nfs.disable: off
>> server.allow-insecure: on
>> storage.owner-gid: 100
>> server.manage-gids: on
>> cluster.lookup-optimize: on
>> server.event-threads: 8
>> client.event-threads: 8
>> changelog.changelog: off
>> storage.build-pgfid: on
>> performance.readdir-ahead: on
>> diagnostics.brick-log-level: WARNING
>> diagnostics.client-log-level: WARNING
>> cluster.rebal-throttle: aggressive
>> performance.cache-size: 1024MB
>> performance.write-behind-window-size: 10MB
>>
>>
>> [root@gfs01bkp logs]# rpm -qa | grep gluster
>> glusterfs-server-3.7.9-1.el6.x86_64
>> glusterfs-debuginfo-3.7.9-1.el6.x86_64
>> glusterfs-api-3.7.9-1.el6.x86_64
>> 

[Gluster-devel] netbsd smoke failure

2016-04-29 Thread Susant Palai
Hi All,
  On many of my patches the following error is seen from netbsd smoke test.

Triggered by Gerrit: http://review.gluster.org/13993
Building remotely on netbsd0.cloud.gluster.org (netbsd_build) in workspace 
/home/jenkins/root/workspace/netbsd6-smoke
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url git://review.gluster.org/glusterfs.git # 
 > timeout=10
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
git://review.gluster.org/glusterfs.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1066)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1097)
at hudson.scm.SCM.checkout(SCM.java:485)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1269)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
at hudson.model.Run.execute(Run.java:1738)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:410)
Caused by: hudson.plugins.git.GitException: Command "git config 
remote.origin.url git://review.gluster.org/glusterfs.git" returned status code 
255:
stdout: 
stderr: error: could not lock config file .git/config: File exists



Please let me know how this can be resolved.

Here are few links of netbsd logs:
https://build.gluster.org/job/netbsd6-smoke/13136/console
https://build.gluster.org/job/netbsd6-smoke/13137/console

Thanks,
Susant
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Xavier Hernandez

With your patch applied, it seems that the bug is not hit.

I guess it's a timing issue that the new logging hides. Maybe no more 
data available after reading the partial readv header ? (it will arrive 
later).


I'll continue testing...

Xavi

On 29/04/16 13:48, Raghavendra Gowdappa wrote:

Attaching the patch.

- Original Message -

From: "Raghavendra Gowdappa" 
To: "Xavier Hernandez" 
Cc: "Gluster Devel" 
Sent: Friday, April 29, 2016 5:14:02 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Xavier Hernandez" 
To: "Raghavendra Gowdappa" 
Cc: "Gluster Devel" 
Sent: Friday, April 29, 2016 1:21:57 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?

Hi Raghavendra,

yes, the readv response contains xdata. The dict length is 38 (0x26)
and, at the moment of failure, rsp.xdata.xdata_len already contains 0x26.


rsp.xdata.xdata_len having 0x26 even when decoding failed indicates that the
approach used in socket.c to get the length of xdata is correct. However, I
cannot find any other way of xdata going into payload vector other than
xdata_len being zero. Just to be double sure, I've a patch containing debug
message printing xdata_len when decoding fails in socket.c. Can you please
apply the patch, run the tests and revert back with results?



Xavi

On 29/04/16 09:10, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" 
To: "Xavier Hernandez" 
Cc: "Gluster Devel" 
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Raghavendra Gowdappa" 
To: "Xavier Hernandez" 
Cc: "Jeff Darcy" , "Gluster Devel"

Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Xavier Hernandez" 
To: "Jeff Darcy" 
Cc: "Gluster Devel" 
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer
?



Hi Jeff,

On 28.04.2016 15:20, Jeff Darcy wrote:



This happens with Gluster 3.7.11 accessed through Ganesha and gfapi.
The
volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
starting
the
read test. As can be seen in the data below, client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154 bytes
(the
buffer in memory really seems to contain 154 bytes). The data on the
network
seems ok (at least I haven't been able to identify any problem), so
this
must be a processing error on the client side. The last field in cut
buffer
of the sequentialized data corresponds to the length of the xdata
field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi.  It would be *very* interesting to see what
the value of the "count" parameter is (it's unfortunately optimized
out).
I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
some problems with how this iov is put together, a couple of years
ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38 bytes of the
second
vector contains the remaining data of xdata field.


This is the bug. client3_3_readv_cbk (and for that matter all the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is subjected to
decoding/encoding. This vector should point to a buffer that contains
the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a write
request),
second vector contains the buffer pointing to the entire (contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.

In your case, this _clean_ separation is broken with part of program
header
slipping into 2nd vector supposed to contain read data (may be because
of
rpc fragmentation). I think this is a bug in socket layer. I'll update
more
on this.


Does your read response include xdata too? I think the code related to
reading xdata in readv response is a bit murky.



case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
_rsp);

proghdr_buf = frag->fragcurrent;


Re: [Gluster-devel] How to enable ACL support in Glusterfs volume

2016-04-29 Thread ABHISHEK PALIWAL
Hi Niels,

Now I am able to run 'setfacl' command on Gluster volume using Kenrel NFS.

The problem was, I was exporting Gluster volume mount point which I get
after mounting the gluster volume in /etc/exports file:

like
mount -t glusterfs -o acl :/ 


But instead of using this point I need to export the volume brick path i.e.
'/tmp/brick/gv0'

gluster volume info

Volume Name: gv0
Type: Distribute
Volume ID: c3d636aa-f718-47b2-90eb-2b5846ad52a2
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 128.224.95.140:/tmp/brick/gv0
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on

and mount on remote as below:

mount -t nfs -o acl,vers=3 128.224.95.140:/tmp/brick/gv0 

then run

setfacl -m u:nobody:rw /

Now I have one question here, please answer it:

Please let me confirm, is there any side effect to export brick path?


Regards,
Abhishek


On Thu, Apr 28, 2016 at 5:35 PM, ABHISHEK PALIWAL 
wrote:

>
>
> On Thu, Apr 28, 2016 at 4:13 PM, Niels de Vos  wrote:
>
>> On Thu, Apr 28, 2016 at 12:05:37PM +0530, ABHISHEK PALIWAL wrote:
>> > Hi,
>> >
>> > I have one more query:
>> >
>> > I am using machine with ip 10.32.0.48 where gluster is running and
>> mounted
>> > my gluster volume as follows
>> >
>> > mount -t glusterfs -o acl 10.32.0.48:/c_glusterfs /mnt/c
>> >
>> > and after that I mounted /mnt/c volume to /tmp/l on same machin
>> 10.32.0.48
>> >
>> > mount -t nfs -o acl,vers=3 10.32.0.48:/mnt/c /tmp/l
>> >
>> > When I run setfacl command on /tmp/l (mounted as nfs) volume its not
>> > working
>> > # setfacl -m u:application:r /tmp/l/usr
>> > setfacl: /tmp/l/usr: Operation not supported
>> >
>> > but when I run setfacl command on /mnt/c(mounted as glusterfs) it is
>> > working
>> > # setfacl -m u:application:r /mnt/c
>> >
>> > Could you please tell me the reason for this.
>>
>> Note that NFSv3 ACLs are not part of the NFS protocol itself. It is
>> handled by a side-band protocol. If all ACL operations on any NFS server
>> fail, make sure to check that the ports for NFSv3 ACLs are open. You can
>> chech that with 'rpcinfo -p $NFS_SERVER'.
>>
>
> ACL operation working fine with other NFS servers and ports are also open
>
>
>
>>
>> Gluster/NFS should have ACLs enabled by default. It is possible to
>> disable support for ACLs in Gluster/NFS with the 'nfs.acl' volume
>> option, just make sure that the option is not set, or is set to 'true'.
>>
>
> I have tried with Gluster/NFS option where
>
> nfs.disable off
> nfs.acl on
> but still getting setfacl command failure.
>
>>
>> HTH,
>> Niels
>>
>>
>> >
>> > Regards,
>> > Abhishek
>> >
>> > On Wed, Apr 27, 2016 at 4:56 PM, Niels de Vos 
>> wrote:
>> >
>> > > Thank you for your email.
>> > >
>> > > I am out of the office on 27-April-2016 and will return on
>> 28-April-2016.
>> > > While I am out I will have limited access to email. When I have
>> returned, I
>> > > will respond to your message as soon as possible.
>> > >
>> > > Many thanks,
>> > > Niels de Vos
>> > >
>> >
>> >
>> >
>> > --
>> >
>> >
>> >
>> >
>> > Regards
>> > Abhishek Paliwal
>>
>
>
>
> --
>
>
>
>
> Regards
> Abhishek Paliwal
>



-- 




Regards
Abhishek Paliwal
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] gluster 3.7.9 permission denied and mv errors

2016-04-29 Thread Raghavendra G
On Wed, Apr 13, 2016 at 10:00 PM, David F. Robinson <
david.robin...@corvidtec.com> wrote:

> I am running into two problems (possibly related?).
>
> 1) Every once in a while, when I do a 'rm -rf DIRNAME', it comes back with
> an error:
> rm: cannot remove `DIRNAME` : Directory not empty
>
> If I try the 'rm -rf' again after the error, it deletes the
> directory.  The issue is that I have scripts that clean up directories, and
> they are failing unless I go through the deletes a 2nd time.
>

What kind of mount are you using? Is it a FUSE or NFS mount? Recently we
saw a similar issue on NFS clients on RHEL6 where rm -rf used to fail with
ENOTEMPTY in some specific cases.


>
> 2) I have different scripts to move a large numbers of files (5-25k) from
> one directory to another.  Sometimes I receive an error:
> /bin/mv: cannot move `xyz` to `../bkp00/xyz`: File exists
>

Does ./bkp00/xyz exist on backend? If yes, what is the value of gfid xattr
(key: "trusted.gfid") for "xyz" and "./bkp00/xyz" on backend bricks (I need
gfid from all the bricks) when this issue happens?


> The move is done using '/bin/mv -f', so it should overwrite the file
> if it exists.  I have tested this with hundreds of files, and it works as
> expected.  However, every few days the script that moves the files will
> have problems with 1 or 2 files during the move.  This is one move problem
> out of roughly 10,000 files that are being moved and I cannot figure out
> any reason for the intermittent problem.
>
> Setup details for my gluster configuration shown below.
>
> [root@gfs01bkp logs]# gluster volume info
>
> Volume Name: gfsbackup
> Type: Distribute
> Volume ID: e78d5123-d9bc-4d88-9c73-61d28abf0b41
> Status: Started
> Number of Bricks: 7
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/gfsbackup
> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/gfsbackup
> Brick3: gfsib02bkp.corvidtec.com:/data/brick01bkp/gfsbackup
> Brick4: gfsib02bkp.corvidtec.com:/data/brick02bkp/gfsbackup
> Brick5: gfsib02bkp.corvidtec.com:/data/brick03bkp/gfsbackup
> Brick6: gfsib02bkp.corvidtec.com:/data/brick04bkp/gfsbackup
> Brick7: gfsib02bkp.corvidtec.com:/data/brick05bkp/gfsbackup
> Options Reconfigured:
> nfs.disable: off
> server.allow-insecure: on
> storage.owner-gid: 100
> server.manage-gids: on
> cluster.lookup-optimize: on
> server.event-threads: 8
> client.event-threads: 8
> changelog.changelog: off
> storage.build-pgfid: on
> performance.readdir-ahead: on
> diagnostics.brick-log-level: WARNING
> diagnostics.client-log-level: WARNING
> cluster.rebal-throttle: aggressive
> performance.cache-size: 1024MB
> performance.write-behind-window-size: 10MB
>
>
> [root@gfs01bkp logs]# rpm -qa | grep gluster
> glusterfs-server-3.7.9-1.el6.x86_64
> glusterfs-debuginfo-3.7.9-1.el6.x86_64
> glusterfs-api-3.7.9-1.el6.x86_64
> glusterfs-resource-agents-3.7.9-1.el6.noarch
> gluster-nagios-common-0.1.1-0.el6.noarch
> glusterfs-libs-3.7.9-1.el6.x86_64
> glusterfs-fuse-3.7.9-1.el6.x86_64
> glusterfs-extra-xlators-3.7.9-1.el6.x86_64
> glusterfs-geo-replication-3.7.9-1.el6.x86_64
> glusterfs-3.7.9-1.el6.x86_64
> glusterfs-cli-3.7.9-1.el6.x86_64
> glusterfs-devel-3.7.9-1.el6.x86_64
> glusterfs-rdma-3.7.9-1.el6.x86_64
> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
> glusterfs-client-xlators-3.7.9-1.el6.x86_64
> glusterfs-api-devel-3.7.9-1.el6.x86_64
> python-gluster-3.7.9-1.el6.noarch
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?

2016-04-29 Thread Raghavendra G
On Thu, Apr 7, 2016 at 2:02 AM, Glomski, Patrick <
patrick.glom...@corvidtec.com> wrote:

> We run gluster 3.7 in a distributed replicated setup. Infiniband (tcp)
> links the gluster peers together and clients use the ethernet interface.
>
> This setup is stable running CentOS 6.x and using the most recent
> infiniband drivers provided by Mellanox. Uptime was 170 days when we took
> it down to wipe the systems and update to CentOS 7.
>
> When the exact same setup is loaded onto a CentOS 7 machine (minor setup
> differences, but basically the same; setup is handled by ansible), the
> peers will (seemingly randomly) experience a hard crash and need to be
> power-cycled. There is no output on the screen and nothing in the logs.
> After rebooting, the peer reconnects, heals whatever files it missed, and
> everything is happy again. Maximum uptime for any given peer is 20 days.
> Thanks to the replication, clients maintain connectivity, but from a system
> administration perspective it's driving me crazy!
>
> We run other storage servers with the same infiniband and CentOS7 setup
> except that they use NFS instead of gluster. NFS shares are served through
> infiniband to some machines and ethernet to others.
>
> Is it possible that gluster's (and only gluster's) use of the infiniband
> kernel module to send tcp packets to its peers on a 3 kernel is causing the
> system to have a hard crash?
>

Please note that Gluster is only a "userspace" consumer of infiniband. So,
at least in "theory" it shouldn't result in kernel panic. However
infiniband also allows userspace programs to do somethings which can be
done only by kernel (like pinning pages to a specific address). I am not
very familiar with internals of infiniband and hence cannot authoritatively
comment on whether kernel panic is possible/impossible. Some one with an
understanding of infiniband internals would be in a better position to
comment on this.


Pretty specific problem and it doesn't make much sense to me, but that's
> sure where the evidence seems to point.
>
> Anyone running CentOS 7 gluster arrays with infiniband out there to
> confirm that it works fine for them? Gluster devs care to chime in with a
> better theory? I'd love for this random crashing to stop.
>
> Thanks,
> Patrick
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Raghavendra G
Seems like I missed adding rtalur/sakshi to cc list.

On Fri, Apr 29, 2016 at 5:25 PM, Raghavendra G 
wrote:

> Raghavendra Talur reported another crash in dht_rename_lock_cbk (which is
> similar - not exactly same - to the bt presented here). I heard Sakshi is
> taking a look into this.
>
> Rtalur/Sakshi,
>
> Can you please post your findings here?
>
> regards,
> Raghavendra
>
> On Fri, Apr 29, 2016 at 4:50 PM, Jeff Darcy  wrote:
>
>> > The test is doing renames where source and target directories are
>> > different. At the same time a new ec-set is added and rebalance started.
>> > Rebalance will cause dht to also move files between bricks. Maybe this
>> > is causing some race in dht ?
>> >
>> > I'll try to continue investigating when I have some time.
>>
>> That would be great, but if you've pursued this as far as DHT then it
>> would be OK to hand it off to that team as well.  Thanks!
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Raghavendra Gowdappa
Attaching the patch.

- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Xavier Hernandez" 
> Cc: "Gluster Devel" 
> Sent: Friday, April 29, 2016 5:14:02 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> 
> 
> - Original Message -
> > From: "Xavier Hernandez" 
> > To: "Raghavendra Gowdappa" 
> > Cc: "Gluster Devel" 
> > Sent: Friday, April 29, 2016 1:21:57 PM
> > Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> > 
> > Hi Raghavendra,
> > 
> > yes, the readv response contains xdata. The dict length is 38 (0x26)
> > and, at the moment of failure, rsp.xdata.xdata_len already contains 0x26.
> 
> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates that the
> approach used in socket.c to get the length of xdata is correct. However, I
> cannot find any other way of xdata going into payload vector other than
> xdata_len being zero. Just to be double sure, I've a patch containing debug
> message printing xdata_len when decoding fails in socket.c. Can you please
> apply the patch, run the tests and revert back with results?
> 
> > 
> > Xavi
> > 
> > On 29/04/16 09:10, Raghavendra Gowdappa wrote:
> > >
> > >
> > > - Original Message -
> > >> From: "Raghavendra Gowdappa" 
> > >> To: "Xavier Hernandez" 
> > >> Cc: "Gluster Devel" 
> > >> Sent: Friday, April 29, 2016 12:36:43 PM
> > >> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> > >>
> > >>
> > >>
> > >> - Original Message -
> > >>> From: "Raghavendra Gowdappa" 
> > >>> To: "Xavier Hernandez" 
> > >>> Cc: "Jeff Darcy" , "Gluster Devel"
> > >>> 
> > >>> Sent: Friday, April 29, 2016 12:07:59 PM
> > >>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> > >>>
> > >>>
> > >>>
> > >>> - Original Message -
> >  From: "Xavier Hernandez" 
> >  To: "Jeff Darcy" 
> >  Cc: "Gluster Devel" 
> >  Sent: Thursday, April 28, 2016 8:15:36 PM
> >  Subject: Re: [Gluster-devel] Possible bug in the communications layer
> >  ?
> > 
> > 
> > 
> >  Hi Jeff,
> > 
> >  On 28.04.2016 15:20, Jeff Darcy wrote:
> > 
> > 
> > 
> >  This happens with Gluster 3.7.11 accessed through Ganesha and gfapi.
> >  The
> >  volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
> >  problem
> >  easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
> >  -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
> >  -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
> >  starting
> >  the
> >  read test. As can be seen in the data below, client3_3_readv_cbk() is
> >  processing an iovec of 116 bytes, however it should be of 154 bytes
> >  (the
> >  buffer in memory really seems to contain 154 bytes). The data on the
> >  network
> >  seems ok (at least I haven't been able to identify any problem), so
> >  this
> >  must be a processing error on the client side. The last field in cut
> >  buffer
> >  of the sequentialized data corresponds to the length of the xdata
> >  field:
> >  0x26. So at least 38 more byte should be present.
> >  Nice detective work, Xavi.  It would be *very* interesting to see what
> >  the value of the "count" parameter is (it's unfortunately optimized
> >  out).
> >  I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
> >  some problems with how this iov is put together, a couple of years
> >  ago,
> >  and it looks like you might have tripped over one more.
> >  It seems you are right. The count is 2 and the first 38 bytes of the
> >  second
> >  vector contains the remaining data of xdata field.
> > >>>
> > >>> This is the bug. client3_3_readv_cbk (and for that matter all the
> > >>> actors/cbks) expects response in utmost two vectors:
> > >>> 1. Program header containing request or response. This is subjected to
> > >>> decoding/encoding. This vector should point to a buffer that contains
> > >>> the
> > >>> entire program header/response contiguously.
> > >>> 2. If the procedure returns payload (like readv response or a write
> > >>> request),
> > >>> second vector contains the buffer pointing to the entire (contiguous)
> > >>> payload. Note that this payload is raw and is not subjected to
> > >>> encoding/decoding.
> > >>>
> > >>> In your case, this _clean_ separation is broken with part of program
> > >>> header
> > >>> slipping into 2nd vector supposed to contain read data (may be because
> > >>> 

Re: [Gluster-devel] How use Gluster/NFS

2016-04-29 Thread Kaleb S. KEITHLEY
On 04/29/2016 07:34 AM, Rick Macklem wrote:
> Abhishek Paliwal wrote:
>> Hi Team,
>>
>> I want to use gluster NFS and export this gluster volume using 'mount -t nfs
>> -o acl' command.
>>
>> i have done the following changes:
>> 1. Enable the NFS using nfs.disable off
>> 2. Enable the ACL using nfs.acl on
>> 3. RPCbind is also running
>> 4. Kernel NFS is stopped
>>
> You could try setting
>  nfs.register-with-portmap on
> I thought it was enabled by default, but maybe that changed
> when the default for nfs.disable changed?

The default for nfs.disable is _only_ changing starting with GlusterFS-3.8.

GlusterFS-3.8 HASN'T BEEN RELEASED YET.

IOW the default for nfs.disable has _not_ changed in GlusterFS-3.7 and
nfs.register-with-portmap _remains_ enabled by default; and will remain
enabled by default even in GlusterFS-3.8.


-- 

Kaleb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Raghavendra Gowdappa


- Original Message -
> From: "Xavier Hernandez" 
> To: "Raghavendra Gowdappa" 
> Cc: "Gluster Devel" 
> Sent: Friday, April 29, 2016 1:21:57 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> Hi Raghavendra,
> 
> yes, the readv response contains xdata. The dict length is 38 (0x26)
> and, at the moment of failure, rsp.xdata.xdata_len already contains 0x26.

rsp.xdata.xdata_len having 0x26 even when decoding failed indicates that the 
approach used in socket.c to get the length of xdata is correct. However, I 
cannot find any other way of xdata going into payload vector other than 
xdata_len being zero. Just to be double sure, I've a patch containing debug 
message printing xdata_len when decoding fails in socket.c. Can you please 
apply the patch, run the tests and revert back with results?

> 
> Xavi
> 
> On 29/04/16 09:10, Raghavendra Gowdappa wrote:
> >
> >
> > - Original Message -
> >> From: "Raghavendra Gowdappa" 
> >> To: "Xavier Hernandez" 
> >> Cc: "Gluster Devel" 
> >> Sent: Friday, April 29, 2016 12:36:43 PM
> >> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> >>
> >>
> >>
> >> - Original Message -
> >>> From: "Raghavendra Gowdappa" 
> >>> To: "Xavier Hernandez" 
> >>> Cc: "Jeff Darcy" , "Gluster Devel"
> >>> 
> >>> Sent: Friday, April 29, 2016 12:07:59 PM
> >>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> >>>
> >>>
> >>>
> >>> - Original Message -
>  From: "Xavier Hernandez" 
>  To: "Jeff Darcy" 
>  Cc: "Gluster Devel" 
>  Sent: Thursday, April 28, 2016 8:15:36 PM
>  Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> 
> 
>  Hi Jeff,
> 
>  On 28.04.2016 15:20, Jeff Darcy wrote:
> 
> 
> 
>  This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
>  volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
>  problem
>  easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
>  -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
>  -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
>  the
>  read test. As can be seen in the data below, client3_3_readv_cbk() is
>  processing an iovec of 116 bytes, however it should be of 154 bytes (the
>  buffer in memory really seems to contain 154 bytes). The data on the
>  network
>  seems ok (at least I haven't been able to identify any problem), so this
>  must be a processing error on the client side. The last field in cut
>  buffer
>  of the sequentialized data corresponds to the length of the xdata field:
>  0x26. So at least 38 more byte should be present.
>  Nice detective work, Xavi.  It would be *very* interesting to see what
>  the value of the "count" parameter is (it's unfortunately optimized
>  out).
>  I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
>  some problems with how this iov is put together, a couple of years ago,
>  and it looks like you might have tripped over one more.
>  It seems you are right. The count is 2 and the first 38 bytes of the
>  second
>  vector contains the remaining data of xdata field.
> >>>
> >>> This is the bug. client3_3_readv_cbk (and for that matter all the
> >>> actors/cbks) expects response in utmost two vectors:
> >>> 1. Program header containing request or response. This is subjected to
> >>> decoding/encoding. This vector should point to a buffer that contains the
> >>> entire program header/response contiguously.
> >>> 2. If the procedure returns payload (like readv response or a write
> >>> request),
> >>> second vector contains the buffer pointing to the entire (contiguous)
> >>> payload. Note that this payload is raw and is not subjected to
> >>> encoding/decoding.
> >>>
> >>> In your case, this _clean_ separation is broken with part of program
> >>> header
> >>> slipping into 2nd vector supposed to contain read data (may be because of
> >>> rpc fragmentation). I think this is a bug in socket layer. I'll update
> >>> more
> >>> on this.
> >>
> >> Does your read response include xdata too? I think the code related to
> >> reading xdata in readv response is a bit murky.
> >>
> >> 
> >>
> >> case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
> >> default_read_size = xdr_sizeof ((xdrproc_t)
> >> xdr_gfs3_read_rsp,
> >> _rsp);
> >>
> >> proghdr_buf = frag->fragcurrent;
> >>
> >> __socket_proto_init_pending (priv, 

Re: [Gluster-devel] How use Gluster/NFS

2016-04-29 Thread Rick Macklem
Abhishek Paliwal wrote:
> Hi Team,
> 
> I want to use gluster NFS and export this gluster volume using 'mount -t nfs
> -o acl' command.
> 
> i have done the following changes:
> 1. Enable the NFS using nfs.disable off
> 2. Enable the ACL using nfs.acl on
> 3. RPCbind is also running
> 4. Kernel NFS is stopped
> 
You could try setting
 nfs.register-with-portmap on
I thought it was enabled by default, but maybe that changed
when the default for nfs.disable changed?

Good luck with it, rick

> But still getting follows errors:
> 
> mount.nfs: mount(2): Connection refused
> mount.nfs: portmap query retrying: RPC: Program not registered
> 
> mount.nfs: portmap query failed: RPC: Program not registered
> 
> mount.nfs: requested NFS version or transport protocol is not supported
> mount.nfs: timeout set for Fri Apr 29 06:13:25 2016
> mount.nfs: trying text-based options
> 'acl,vers=4,addr=10.32.0.48,clientaddr=10.32.0.48'
> mount.nfs: trying text-based options 'acl,addr=10.32.0.48'
> mount.nfs: prog 13, trying vers=3, prot=6
> mount.nfs: prog 13, trying vers=3, prot=17
> 
> After execute the mount command as follows:
> 
> mount -v -t nfs -o acl 10.32.0.48:/opt/lvmdir/c2/brick /tmp/p
> 
> #rpcinfo -p output
> 
> # rpcinfo -p
> program vers proto port service
> 10 4 tcp 111 portmapper
> 10 3 tcp 111 portmapper
> 10 2 tcp 111 portmapper
> 10 4 udp 111 portmapper
> 10 3 udp 111 portmapper
> 10 2 udp 111 portmapper
> 100024 1 udp 53564 status
> 100024 1 tcp 60246 status
> 
> 
> Showing no open port for GLuster/NFS so please tell me the steps to enable
> the Gluster/NFS
> 
> 
> --
> 
> 
> 
> 
> Regards
> Abhishek Paliwal
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Jeff Darcy
> The test is doing renames where source and target directories are
> different. At the same time a new ec-set is added and rebalance started.
> Rebalance will cause dht to also move files between bricks. Maybe this
> is causing some race in dht ?
> 
> I'll try to continue investigating when I have some time.

That would be great, but if you've pursued this as far as DHT then it
would be OK to hand it off to that team as well.  Thanks!
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Geo-replication in Tiering based volume - Review request

2016-04-29 Thread Saravanakumar Arumugam

Hi,

Please provide your comments / feedback about handling Geo-replication 
in a Tiering based volume.


link: 
https://github.com/gluster/glusterfs-specs/blob/master/under_review/Tiering_georeplication.md



Following patches fixes the issue(which are already merged):

http://review.gluster.org/#/c/12326

http://review.gluster.org/#/c/12355

http://review.gluster.org/#/c/12239

http://review.gluster.org/#/c/12417

http://review.gluster.org/#/c/12844

http://review.gluster.org/#/c/13281


Thanks,
Saravana

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Requesting lock-migration reviews

2016-04-29 Thread Susant Palai
Hi All,
  Following patches need reviews for lock-migration feature. They are targeted 
for 3.8.
Requesting for reviews.

1- http://review.gluster.org/#/c/13970/
2- http://review.gluster.org/#/c/13993/
3- http://review.gluster.org/#/c/13994/
4- http://review.gluster.org/#/c/13995/
5- http://review.gluster.org/#/c/14011/
6- http://review.gluster.org/#/c/14012/
7- http://review.gluster.org/#/c/14013/
8- http://review.gluster.org/#/c/14014/
9- http://review.gluster.org/#/c/14024/
10- http://review.gluster.org/#/c/13493/
11-http://review.gluster.org/#/c/14074/

Thanks,
Susant
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Bugs with incorrect status

2016-04-29 Thread ndevos
1008839 (mainline) POST: Certain blocked entry lock info not retained after the 
lock is granted
  [master] Ie37837 features/locks : Certain blocked entry lock info not 
retained after the lock is granted (ABANDONED)
  ** ata...@redhat.com: Bug 1008839 is in POST, but all changes have been 
abandoned **

1062437 (mainline) POST: stripe does not work with empty xlator
  [master] I778699 stripe: fix FIRST_CHILD checks to be more general (ABANDONED)
  ** jda...@redhat.com: Bug 1062437 is in POST, but all changes have been 
abandoned **

1074947 (mainline) ON_QA: add option to bulld rpm without server
  [master] Iaa1498 build: add option to bulld rpm without server (NEW)
  ** nde...@redhat.com: Bug 1074947 should be in POST, change Iaa1498 under 
review **

1089642 (mainline) POST: Quotad doesn't load io-stats xlator, which implies 
none of the logging options have any effect on it.
  [master] Iccc033 glusterd: add io-stats to all quotad's sub-graphs (ABANDONED)
  ** spa...@redhat.com: Bug 1089642 is in POST, but all changes have been 
abandoned **

1092414 (mainline) POST: Disable NFS by default
  [master] Ibdf990 glusterd: default value of nfs.disable, change from false to 
true (MERGED)
  [master] If52f5e glusterd: default value of nfs.disable, change from false to 
true (MERGED)
  ** nde...@redhat.com: Bug 1092414 should be MODIFIED, change If52f5e has been 
merged **

1093768 (3.5.0) POST: Comment typo in gf-history.changelog.c
  ** kschi...@redhat.com: No change posted, but bug 1093768 is in POST **

1094478 (3.5.0) POST: Bad macro in changelog-misc.h
  ** kschi...@redhat.com: No change posted, but bug 1094478 is in POST **

1099294 (3.5.0) POST: Incorrect error message in 
/features/changelog/lib/src/gf-history-changelog.c
  ** kschi...@redhat.com: No change posted, but bug 1099294 is in POST **

1099460 (3.5.0) NEW: file locks are not released within an acceptable time when 
a fuse-client uncleanly disconnects
  [release-3.5] I5e5f54 socket: use TCP_USER_TIMEOUT to detect client failures 
quicker (NEW)
  ** nde...@redhat.com: Bug 1099460 should be in POST, change I5e5f54 under 
review **

1099683 (3.5.0) POST: Silent error from call to realpath in 
features/changelog/lib/src/gf-history-changelog.c
  ** vshan...@redhat.com: No change posted, but bug 1099683 is in POST **

110 (mainline) MODIFIED: [RFE] Add regression tests for the component 
geo-replication
  [master] Ie27848 tests/geo-rep: Automated configuration for geo-rep 
regression. (NEW)
  [master] I9c9ae8 geo-rep: Regression tests improvements (ABANDONED)
  [master] I433dd8 Geo-rep: Adding regression tests for geo-rep (MERGED)
  [master] Ife8201 Geo-rep: Adding regression tests for geo-rep (ABANDONED)
  ** khire...@redhat.com: Bug 110 should be in POST, change Ie27848 under 
review **

020 (3.5.0) POST: Unused code changelog_entry_length
  ** kschi...@redhat.com: No change posted, but bug 020 is in POST **

031 (3.5.0) POST: CHANGELOG_FILL_HTIME_DIR macro fills buffer without size 
limits
  ** kschi...@redhat.com: No change posted, but bug 031 is in POST **

1114415 (mainline) MODIFIED: There is no way to monitor if the healing is 
successful when the brick is erased
  ** pkara...@redhat.com: No change posted, but bug 1114415 is in MODIFIED **

1116714 (3.5.0) POST: indices/xattrop directory contains stale entries
  [release-3.5] I470cf8 afr : Added xdata flags to indicate probable existence 
of stale index. (ABANDONED)
  ** ata...@redhat.com: Bug 1116714 is in POST, but all changes have been 
abandoned **

1117886 (mainline) MODIFIED: Gluster not resolving hosts with IPv6 only lookups
  [master] Icbaa3c Reinstate ipv6 support (NEW)
  [master] Iebc96e glusterd: Bug fixes for IPv6 support (MERGED)
  ** nithind1...@yahoo.in: Bug 1117886 should be in POST, change Icbaa3c under 
review **

1122120 (3.5.1) MODIFIED: Bricks crashing after disable and re-enabled quota on 
a volume
  ** kdhan...@redhat.com: No change posted, but bug 1122120 is in MODIFIED **

1131846 (mainline) POST: remove-brick - once you stop remove-brick using stop 
command,  status says '  failed: remove-brick not started.'
  ** gg...@redhat.com: No change posted, but bug 1131846 is in POST **

1132074 (mainline) POST: Document steps to perform for replace-brick
  [master] Ic7292b doc: Steps for Replacing brick in gluster volume (ABANDONED)
  ** pkara...@redhat.com: Bug 1132074 is in POST, but all changes have been 
abandoned **

1134305 (mainline) POST: rpc actor failed to complete successfully messages in 
Glusterd
  [master] I094516 protocol/client: Add explicit states for connection sequence 
(ABANDONED)
  ** pkara...@redhat.com: Bug 1134305 is in POST, but all changes have been 
abandoned **

1142423 (mainline) MODIFIED: [DHT-REBALANCE]-DataLoss: The data appended to a 
file during its migration will be lost once the migration is done
  [master] I044a83 cluster/dht Use additional dst_info in inode_ctx (NEW)
  [master] I5c8810 cluster/dht: Fix stale 

Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Xavier Hernandez

Hi Jeff,

On 27/04/16 20:01, Jeff Darcy wrote:

One of the "rewards" of reviewing and merging people's patches is getting email 
if the next regression-test-burn-in should fail - even if it fails for a completely 
unrelated reason.  Today I got one that's not among the usual suspects.  The failure was 
a core dump in tests/bugs/disperse/bug-1304988.t, weighing in at a respectable 42 frames.

#0  0x7fef25976cb9 in dht_rename_lock_cbk
#1  0x7fef25955f62 in dht_inodelk_done
#2  0x7fef25957352 in dht_blocking_inodelk_cbk
#3  0x7fef32e02f8f in default_inodelk_cbk
#4  0x7fef25c029a3 in ec_manager_inodelk
#5  0x7fef25bf9802 in __ec_manager
#6  0x7fef25bf990c in ec_manager
#7  0x7fef25c03038 in ec_inodelk
#8  0x7fef25bee7ad in ec_gf_inodelk
#9  0x7fef25957758 in dht_blocking_inodelk_rec
#10 0x7fef25957b2d in dht_blocking_inodelk
#11 0x7fef2597713f in dht_rename_lock
#12 0x7fef25977835 in dht_rename
#13 0x7fef32e0f032 in default_rename
#14 0x7fef32e0f032 in default_rename
#15 0x7fef32e0f032 in default_rename
#16 0x7fef32e0f032 in default_rename
#17 0x7fef32e0f032 in default_rename
#18 0x7fef32e07c29 in default_rename_resume
#19 0x7fef32d8ed40 in call_resume_wind
#20 0x7fef32d98b2f in call_resume
#21 0x7fef24cfc568 in open_and_resume
#22 0x7fef24cffb99 in ob_rename
#23 0x7fef24aee482 in mdc_rename
#24 0x7fef248d68e5 in io_stats_rename
#25 0x7fef32e0f032 in default_rename
#26 0x7fef2ab1b2b9 in fuse_rename_resume
#27 0x7fef2ab12c47 in fuse_fop_resume
#28 0x7fef2ab107cc in fuse_resolve_done
#29 0x7fef2ab108a2 in fuse_resolve_all
#30 0x7fef2ab10900 in fuse_resolve_continue
#31 0x7fef2ab0fb7c in fuse_resolve_parent
#32 0x7fef2ab1077d in fuse_resolve
#33 0x7fef2ab10879 in fuse_resolve_all
#34 0x7fef2ab10900 in fuse_resolve_continue
#35 0x7fef2ab0fb7c in fuse_resolve_parent
#36 0x7fef2ab1077d in fuse_resolve
#37 0x7fef2ab10824 in fuse_resolve_all
#38 0x7fef2ab1093e in fuse_resolve_and_resume
#39 0x7fef2ab1b40e in fuse_rename
#40 0x7fef2ab2a96a in fuse_thread_proc
#41 0x7fef3204daa1 in start_thread

In other words we started at FUSE, went through a bunch of performance 
translators, through DHT to EC, and then crashed on the way back.  It seems a 
little odd that we turn the fop around immediately in EC, and that we have 
default_inodelk_cbk at frame 3.  Could one of the DHT or EC people please take 
a look at it?  Thanks!


The part regarding to ec seems ok. This is uncommon, but can happen. 
When ec_gf_inodelk() is called, it sends a inodelk request to all its 
subvolumes. It may happen that the callbacks of all these requests are 
received before returning from ec_gf_inodelk() itself. This executes the 
callback inside the same thread of the caller.


The reason why default_inodelk_cbk() is seen is because ec uses this 
function to report the result back to the caller (instead of calling 
STACK_UNWIND() itself).


This seems what have happened here.

The frames returned by ec to upper xlators are the same used by them 
(the frame in dht_blocking_lock() is the same that receives 
dht_blocking_inodelk_cbk()) and ec doesn't touch them, however the frame 
at 0x7fef1003ca5c is absolutely corrupted.


We can see the call state from the core:

(gdb) f 4
#4  0x7fef25c029a3 in ec_manager_inodelk (fop=0x7fef1000d37c, 
state=5) at 
/home/jenkins/root/workspace/regression-test-burn-in/xlators/cluster/ec/src/ec-locks.c:645

645 fop->cbks.inodelk(fop->req_frame, fop, fop->xl,
(gdb) print fop->answer
$30 = (ec_cbk_data_t *) 0x7fef180094ac
(gdb) print fop->answer->op_ret
$31 = 0
(gdb) print fop->answer->op_errno
$32 = 0
(gdb) print fop->answer->count
$33 = 6
(gdb) print fop->answer->mask
$34 = 63

As we can see there's an actual answer to the request with a success 
result (op_ret == 0 and op_errno == 0) composed of the combination of 
answers from 6 subvolumes (count == 6).


Looking at the dht code I have been unable to see any possible cause either.

The test is doing renames where source and target directories are 
different. At the same time a new ec-set is added and rebalance started. 
Rebalance will cause dht to also move files between bricks. Maybe this 
is causing some race in dht ?


I'll try to continue investigating when I have some time.

Xavi




https://build.gluster.org/job/regression-test-burn-in/868/console
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Xavier Hernandez

Hi Raghavendra,

yes, the readv response contains xdata. The dict length is 38 (0x26) 
and, at the moment of failure, rsp.xdata.xdata_len already contains 0x26.


Xavi

On 29/04/16 09:10, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" 
To: "Xavier Hernandez" 
Cc: "Gluster Devel" 
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Raghavendra Gowdappa" 
To: "Xavier Hernandez" 
Cc: "Jeff Darcy" , "Gluster Devel"

Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



- Original Message -

From: "Xavier Hernandez" 
To: "Jeff Darcy" 
Cc: "Gluster Devel" 
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re: [Gluster-devel] Possible bug in the communications layer ?



Hi Jeff,

On 28.04.2016 15:20, Jeff Darcy wrote:



This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
the
read test. As can be seen in the data below, client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154 bytes (the
buffer in memory really seems to contain 154 bytes). The data on the
network
seems ok (at least I haven't been able to identify any problem), so this
must be a processing error on the client side. The last field in cut
buffer
of the sequentialized data corresponds to the length of the xdata field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi.  It would be *very* interesting to see what
the value of the "count" parameter is (it's unfortunately optimized out).
I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
some problems with how this iov is put together, a couple of years ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38 bytes of the
second
vector contains the remaining data of xdata field.


This is the bug. client3_3_readv_cbk (and for that matter all the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is subjected to
decoding/encoding. This vector should point to a buffer that contains the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a write
request),
second vector contains the buffer pointing to the entire (contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.

In your case, this _clean_ separation is broken with part of program header
slipping into 2nd vector supposed to contain read data (may be because of
rpc fragmentation). I think this is a bug in socket layer. I'll update more
on this.


Does your read response include xdata too? I think the code related to
reading xdata in readv response is a bit murky.



case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
_rsp);

proghdr_buf = frag->fragcurrent;

__socket_proto_init_pending (priv, default_read_size);

frag->call_body.reply.accepted_success_state
= SP_STATE_READING_PROC_HEADER;

/* fall through */

case SP_STATE_READING_PROC_HEADER:
__socket_proto_read (priv, ret);


By this time we've read read response _minus_ the xdata


I meant we have read "readv response header"



/* there can be 'xdata' in read response, figure it out */
xdrmem_create (, proghdr_buf, default_read_size,
   XDR_DECODE);


We created xdr stream above with "default_read_size" (this doesn't
include xdata)


/* This will fail if there is xdata sent from server, if not,
   well and good, we don't need to worry about  */


what if xdata is present and decoding failed (as length of xdr stream
above - default_read_size - doesn't include xdata)? would we have a
valid value in read_rsp.xdata.xdata_len? This is the part I am
confused about. If read_rsp.xdata.xdata_len is not correct then there
is a possibility that xdata might not be entirely present in the
vector socket passes to higher layers as progheader (with part or
entire xdata spilling over to payload vector).


xdr_gfs3_read_rsp (, _rsp);

 

Re: [Gluster-devel] Requesting for NetBSD setup

2016-04-29 Thread Karthik Subrahmanya


- Original Message -
> From: "Emmanuel Dreyfus" 
> To: "Karthik Subrahmanya" 
> Cc: "gluster-devel" , gluster-in...@gluster.org
> Sent: Friday, April 29, 2016 12:35:24 PM
> Subject: Re: [Gluster-devel] Requesting for NetBSD setup
> 
> On Fri, Apr 29, 2016 at 01:28:53AM -0400, Karthik Subrahmanya wrote:
> > I would like to ask for a NetBSD setup
> 
> nbslave7[4gh] are disabled in Jenkins right now. They are labeled
> "Disconnected by kaushal", but I don't kno why. Once it is confirmed
> that they are not alread used for testing, you could pick one.
> 
> I still does not know who is the password guardian at Rehat, though.
> 
Thanks for the advice Emmanuel, I think that is going to take some time.
Can you point me some other alternative way to test it in my system? 
I have actually been stuck with this for sometime now and I really can't
understand why it's failing.

Thanks,
Karthik Subrahmanya

> --
> Emmanuel Dreyfus
> m...@netbsd.org
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] Requesting for NetBSD setup

2016-04-29 Thread Kaushal M
On Fri, Apr 29, 2016 at 12:35 PM, Emmanuel Dreyfus  wrote:
> On Fri, Apr 29, 2016 at 01:28:53AM -0400, Karthik Subrahmanya wrote:
>> I would like to ask for a NetBSD setup
>
> nbslave7[4gh] are disabled in Jenkins right now. They are labeled
> "Disconnected by kaushal", but I don't kno why. Once it is confirmed
> that they are not alread used for testing, you could pick one.
>
> I still does not know who is the password guardian at Rehat, though.

I often disconnect machines that aren't in a working state, and reboot them.
If I've left something in the disconnected state, most likely those
machines didn't get back to a working state after the reboot.
Or it could be that I just forgot.

>
> --
> Emmanuel Dreyfus
> m...@netbsd.org
> ___
> Gluster-infra mailing list
> gluster-in...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-infra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Xavier Hernandez" 
> Cc: "Gluster Devel" 
> Sent: Friday, April 29, 2016 12:36:43 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> 
> 
> - Original Message -
> > From: "Raghavendra Gowdappa" 
> > To: "Xavier Hernandez" 
> > Cc: "Jeff Darcy" , "Gluster Devel"
> > 
> > Sent: Friday, April 29, 2016 12:07:59 PM
> > Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> > 
> > 
> > 
> > - Original Message -
> > > From: "Xavier Hernandez" 
> > > To: "Jeff Darcy" 
> > > Cc: "Gluster Devel" 
> > > Sent: Thursday, April 28, 2016 8:15:36 PM
> > > Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> > > 
> > > 
> > > 
> > > Hi Jeff,
> > > 
> > > On 28.04.2016 15:20, Jeff Darcy wrote:
> > > 
> > > 
> > > 
> > > This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
> > > volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
> > > problem
> > > easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
> > > -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
> > > -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
> > > the
> > > read test. As can be seen in the data below, client3_3_readv_cbk() is
> > > processing an iovec of 116 bytes, however it should be of 154 bytes (the
> > > buffer in memory really seems to contain 154 bytes). The data on the
> > > network
> > > seems ok (at least I haven't been able to identify any problem), so this
> > > must be a processing error on the client side. The last field in cut
> > > buffer
> > > of the sequentialized data corresponds to the length of the xdata field:
> > > 0x26. So at least 38 more byte should be present.
> > > Nice detective work, Xavi.  It would be *very* interesting to see what
> > > the value of the "count" parameter is (it's unfortunately optimized out).
> > > I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
> > > some problems with how this iov is put together, a couple of years ago,
> > > and it looks like you might have tripped over one more.
> > > It seems you are right. The count is 2 and the first 38 bytes of the
> > > second
> > > vector contains the remaining data of xdata field.
> > 
> > This is the bug. client3_3_readv_cbk (and for that matter all the
> > actors/cbks) expects response in utmost two vectors:
> > 1. Program header containing request or response. This is subjected to
> > decoding/encoding. This vector should point to a buffer that contains the
> > entire program header/response contiguously.
> > 2. If the procedure returns payload (like readv response or a write
> > request),
> > second vector contains the buffer pointing to the entire (contiguous)
> > payload. Note that this payload is raw and is not subjected to
> > encoding/decoding.
> > 
> > In your case, this _clean_ separation is broken with part of program header
> > slipping into 2nd vector supposed to contain read data (may be because of
> > rpc fragmentation). I think this is a bug in socket layer. I'll update more
> > on this.
> 
> Does your read response include xdata too? I think the code related to
> reading xdata in readv response is a bit murky.
> 
> 
> 
> case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
> default_read_size = xdr_sizeof ((xdrproc_t)
> xdr_gfs3_read_rsp,
> _rsp);
> 
> proghdr_buf = frag->fragcurrent;
> 
> __socket_proto_init_pending (priv, default_read_size);
> 
> frag->call_body.reply.accepted_success_state
> = SP_STATE_READING_PROC_HEADER;
> 
> /* fall through */
> 
> case SP_STATE_READING_PROC_HEADER:
> __socket_proto_read (priv, ret);
> 
> > By this time we've read read response _minus_ the xdata

I meant we have read "readv response header"

> 
> /* there can be 'xdata' in read response, figure it out */
> xdrmem_create (, proghdr_buf, default_read_size,
>XDR_DECODE);
> 
> >> We created xdr stream above with "default_read_size" (this doesn't
> >> include xdata)
> 
> /* This will fail if there is xdata sent from server, if not,
>well and good, we don't need to worry about  */
> 
> >> what if xdata is present and decoding failed (as length of xdr stream
> >> above - default_read_size - doesn't include xdata)? would we have a
> >> valid value in read_rsp.xdata.xdata_len? This is the part I am
> >> confused about. If read_rsp.xdata.xdata_len is not correct 

[Gluster-devel] Bitrot Review Request

2016-04-29 Thread Kotresh Hiremath Ravishankar
Hi Pranith,

You had a concern of consuming I/O threads when bit-rot uses rchecksum 
interface to
signing, normal scrubbing and on-demand scrubbing with tiering. 
 
  http://review.gluster.org/#/c/13833/5/xlators/storage/posix/src/posix.c

As discussed over comments, the concern is valid and the above patch is not 
being
taken in and would be abandoned.

I have the following patch where the signing and normal scrubbing would not
consume io-threads. Only the on-demand scrubbing consumes io-threads. I think
this should be fine as tiering is single threaded and only consumes
one I/O thread (as told by Joseph on PatchSet 6).

  http://review.gluster.org/#/c/13969/

Since, on-demand scrubbing is disabled by default and there is a size cap and
we document to increase the default number of I/O threads, consuming one I/O
thread for scrubbing would be fine I guess.

Let me know your thoughts.

Thanks and Regards,
Kotresh H R

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Xavier Hernandez" 
> Cc: "Jeff Darcy" , "Gluster Devel" 
> 
> Sent: Friday, April 29, 2016 12:07:59 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> 
> 
> - Original Message -
> > From: "Xavier Hernandez" 
> > To: "Jeff Darcy" 
> > Cc: "Gluster Devel" 
> > Sent: Thursday, April 28, 2016 8:15:36 PM
> > Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> > 
> > 
> > 
> > Hi Jeff,
> > 
> > On 28.04.2016 15:20, Jeff Darcy wrote:
> > 
> > 
> > 
> > This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
> > volume is a distributed-disperse 4*(4+2). I'm able to reproduce the problem
> > easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
> > -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
> > -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting
> > the
> > read test. As can be seen in the data below, client3_3_readv_cbk() is
> > processing an iovec of 116 bytes, however it should be of 154 bytes (the
> > buffer in memory really seems to contain 154 bytes). The data on the
> > network
> > seems ok (at least I haven't been able to identify any problem), so this
> > must be a processing error on the client side. The last field in cut buffer
> > of the sequentialized data corresponds to the length of the xdata field:
> > 0x26. So at least 38 more byte should be present.
> > Nice detective work, Xavi.  It would be *very* interesting to see what
> > the value of the "count" parameter is (it's unfortunately optimized out).
> > I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
> > some problems with how this iov is put together, a couple of years ago,
> > and it looks like you might have tripped over one more.
> > It seems you are right. The count is 2 and the first 38 bytes of the second
> > vector contains the remaining data of xdata field.
> 
> This is the bug. client3_3_readv_cbk (and for that matter all the
> actors/cbks) expects response in utmost two vectors:
> 1. Program header containing request or response. This is subjected to
> decoding/encoding. This vector should point to a buffer that contains the
> entire program header/response contiguously.
> 2. If the procedure returns payload (like readv response or a write request),
> second vector contains the buffer pointing to the entire (contiguous)
> payload. Note that this payload is raw and is not subjected to
> encoding/decoding.
> 
> In your case, this _clean_ separation is broken with part of program header
> slipping into 2nd vector supposed to contain read data (may be because of
> rpc fragmentation). I think this is a bug in socket layer. I'll update more
> on this.

Does your read response include xdata too? I think the code related to reading 
xdata in readv response is a bit murky.



case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
default_read_size = xdr_sizeof ((xdrproc_t) xdr_gfs3_read_rsp,
_rsp);

proghdr_buf = frag->fragcurrent;

__socket_proto_init_pending (priv, default_read_size);

frag->call_body.reply.accepted_success_state
= SP_STATE_READING_PROC_HEADER;

/* fall through */

case SP_STATE_READING_PROC_HEADER:
__socket_proto_read (priv, ret);

> By this time we've read read response _minus_ the xdata

/* there can be 'xdata' in read response, figure it out */
xdrmem_create (, proghdr_buf, default_read_size,
   XDR_DECODE);

>> We created xdr stream above with "default_read_size" (this doesn't 
>> include xdata)

/* This will fail if there is xdata sent from server, if not,   


   well and good, we don't need to worry about  */

>> what if xdata is present and decoding failed (as length of xdr stream 
>> above - default_read_size - doesn't include xdata)? would we have a 
>> valid value in read_rsp.xdata.xdata_len? This is the part I am confused 
>> about. If read_rsp.xdata.xdata_len is not correct then there is a 
>> possibility that xdata might not be entirely present in the vector 
>> socket passes to higher layers as progheader (with part or entire xdata 
>> spilling over to payload vector).

xdr_gfs3_read_rsp (, _rsp);

free (read_rsp.xdata.xdata_val);

/* need to round off to proper roof (%4), as XDR packing pads   


   

Re: [Gluster-devel] Requesting for NetBSD setup

2016-04-29 Thread Emmanuel Dreyfus
On Fri, Apr 29, 2016 at 01:28:53AM -0400, Karthik Subrahmanya wrote:
> I would like to ask for a NetBSD setup

nbslave7[4gh] are disabled in Jenkins right now. They are labeled 
"Disconnected by kaushal", but I don't kno why. Once it is confirmed
that they are not alread used for testing, you could pick one. 

I still does not know who is the password guardian at Rehat, though.

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible bug in the communications layer ?

2016-04-29 Thread Raghavendra Gowdappa


- Original Message -
> From: "Xavier Hernandez" 
> To: "Jeff Darcy" 
> Cc: "Gluster Devel" 
> Sent: Thursday, April 28, 2016 8:15:36 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> 
> 
> Hi Jeff,
> 
> On 28.04.2016 15:20, Jeff Darcy wrote:
> 
> 
> 
> This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
> volume is a distributed-disperse 4*(4+2). I'm able to reproduce the problem
> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 -s10g
> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after starting the
> read test. As can be seen in the data below, client3_3_readv_cbk() is
> processing an iovec of 116 bytes, however it should be of 154 bytes (the
> buffer in memory really seems to contain 154 bytes). The data on the network
> seems ok (at least I haven't been able to identify any problem), so this
> must be a processing error on the client side. The last field in cut buffer
> of the sequentialized data corresponds to the length of the xdata field:
> 0x26. So at least 38 more byte should be present.
> Nice detective work, Xavi.  It would be *very* interesting to see what
> the value of the "count" parameter is (it's unfortunately optimized out).
> I'll bet it's two, and iov[1].iov_len is 38.  I have a weak memory of
> some problems with how this iov is put together, a couple of years ago,
> and it looks like you might have tripped over one more.
> It seems you are right. The count is 2 and the first 38 bytes of the second
> vector contains the remaining data of xdata field. 

This is the bug. client3_3_readv_cbk (and for that matter all the actors/cbks) 
expects response in utmost two vectors:
1. Program header containing request or response. This is subjected to 
decoding/encoding. This vector should point to a buffer that contains the 
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a write request), 
second vector contains the buffer pointing to the entire (contiguous) payload. 
Note that this payload is raw and is not subjected to encoding/decoding.

In your case, this _clean_ separation is broken with part of program header 
slipping into 2nd vector supposed to contain read data (may be because of rpc 
fragmentation). I think this is a bug in socket layer. I'll update more on this.

> The rest of the data in
> the second vector seems the payload of the readv fop, plus a 2 bytes
> padding:
> (gdb) f 0
> #0  client3_3_readv_cbk (req=0x7fdc4051a31c, iov=0x7fdc4051a35c,
> count=, myframe=0x7fdc520d505c) at client-rpc-fops.c:3021
> 3021gf_msg (this->name, GF_LOG_ERROR, EINVAL,
> (gdb) print *iov
> $2 = {iov_base = 0x7fdc14b0d018, iov_len = 116}
> (gdb) f 1
> #1  0x7fdc56dafab0 in rpc_clnt_handle_reply
> (clnt=clnt@entry=0x7fdc3c1f4bb0, pollin=pollin@entry=0x7fdc34010f20) at
> rpc-clnt.c:764
> 764 req->cbkfn (req, req->rsp, req->rspcnt, saved_frame->frame);
> (gdb) print *pollin
> $3 = {vector = {{iov_base = 0x7fdc14b0d000, iov_len = 140}, {iov_base =
> 0x7fdc14a4d000, iov_len = 32808}, {iov_base = 0x0, iov_len = 0}  times>}, count = 2,
> vectored = 1 '\001', private = 0x7fdc340106c0, iobref = 0x7fdc34006660,
> hdr_iobuf = 0x7fdc3c4c07c0, is_reply = 1 '\001'}
> (gdb) f 0
> #0  client3_3_readv_cbk (req=0x7fdc4051a31c, iov=0x7fdc4051a35c,
> count=, myframe=0x7fdc520d505c) at client-rpc-fops.c:3021
> 3021gf_msg (this->name, GF_LOG_ERROR, EINVAL,
> (gdb) print iov[1]
> $4 = {iov_base = 0x7fdc14a4d000, iov_len = 32808}
> (gdb) print iov[2]
> $5 = {iov_base = 0x2, iov_len = 140583741974112}
> (gdb) x/128xb 0x7fdc14a4d000
> 0x7fdc14a4d000: 0x000x000x000x010x000x000x000x17
> 0x7fdc14a4d008: 0x000x000x000x020x670x6c0x750x73
> 0x7fdc14a4d010: 0x740x650x720x660x730x2e0x690x6e
> 0x7fdc14a4d018: 0x6f0x640x650x6c0x6b0x2d0x630x6f
> 0x7fdc14a4d020: 0x750x6e0x740x000x310x000x000x00
> 0x7fdc14a4d028: 0x5c0x5c0x5c0x5c0x5c0x5c0x5c0x5c
> 0x7fdc14a4d030: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d038: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d040: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d048: 0x5c0x000x000x000x000x000x000x00
> 0x7fdc14a4d050: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d058: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d060: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d068: 0x000x000x000x000x000x000x000x00
> 0x7fdc14a4d070: 0x000x000x000x000x000x000x00

[Gluster-devel] How use Gluster/NFS

2016-04-29 Thread ABHISHEK PALIWAL
Hi  Team,

I want to use gluster NFS and export this gluster volume using 'mount -t
nfs -o acl' command.

i have done the following changes:
1. Enable the NFS using nfs.disable off
2. Enable the ACL using nfs.acl on
3. RPCbind is also running
4. Kernel NFS is stopped

But still getting follows errors:

mount.nfs: mount(2): Connection refused
mount.nfs: portmap query retrying: RPC: Program not registered

mount.nfs: portmap query failed: RPC: Program not registered

mount.nfs: requested NFS version or transport protocol is not supported
mount.nfs: timeout set for Fri Apr 29 06:13:25 2016
mount.nfs: trying text-based options
'acl,vers=4,addr=10.32.0.48,clientaddr=10.32.0.48'
mount.nfs: trying text-based options 'acl,addr=10.32.0.48'
mount.nfs: prog 13, trying vers=3, prot=6
mount.nfs: prog 13, trying vers=3, prot=17

After execute the mount command as follows:

mount -v -t nfs -o acl 10.32.0.48:/opt/lvmdir/c2/brick /tmp/p

#rpcinfo -p output

# rpcinfo -p
   program vers proto   port  service
104   tcp111  portmapper
103   tcp111  portmapper
102   tcp111  portmapper
104   udp111  portmapper
103   udp111  portmapper
102   udp111  portmapper
1000241   udp  53564  status
1000241   tcp  60246  status


Showing no open port for GLuster/NFS so please tell me the steps to enable
the Gluster/NFS


-- 




Regards
Abhishek Paliwal
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel