Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
Ignore. I just realised you're on 3.7.14. So then the problem may not be
with granular entry self-heal feature.

-Krutika

On Tue, Aug 30, 2016 at 10:14 AM, Krutika Dhananjay 
wrote:

> OK. Do you also have granular-entry-heal on - just so that I can isolate
> the problem area.
>
> -Krutika
>
> On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic 
> wrote:
>
>> I noticed that my new brick (replacement disk) did not have a .shard
>> directory created on the brick, if that helps.
>>
>> I removed the affected brick from the volume and then wiped the disk, did
>> an add-brick, and everything healed right up. I didn’t try and set any
>> attrs or anything else, just removed and added the brick as new.
>>
>> On Aug 29, 2016, at 9:49 AM, Darrell Budic 
>> wrote:
>>
>> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
>> Some content was healed correctly, now all the shards are queued up in a
>> heal list, but nothing is healing. Got similar brick errors logged to the
>> ones David was getting on the brick that isn’t healing:
>>
>> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
>> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
>> LOOKUP (null) (----
>> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) ==> (Invalid argument)
>> [Invalid argument]
>> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
>> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
>> LOOKUP (null) (----
>> /0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) ==> (Invalid argument)
>> [Invalid argument]
>>
>> This was after replacing the drive the brick was on and trying to get it
>> back into the system by setting the volume's fattr on the brick dir. I’ll
>> try the suggested method here on it it shortly.
>>
>>   -Darrell
>>
>>
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay 
>> wrote:
>>
>> Got it. Thanks.
>>
>> I tried the same test and shd crashed with SIGABRT (well, that's because
>> I compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>>
>> -Krutika
>>
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>>
>>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>>> dgoss...@carouselchecks.com> wrote:
>>>
 On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay  wrote:

> Could you attach both client and brick logs? Meanwhile I will try
> these steps out on my machines and see if it is easily recreatable.
>
>
 Hoping 7z files are accepted by mail server.

>>>
>>> looks like zip file awaiting approval due to size
>>>

 -Krutika
>
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>> Centos 7 Gluster 3.8.3
>>
>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> cluster.self-heal-daemon: on
>> cluster.locking-scheme: granular
>> features.shard-block-size: 64MB
>> features.shard: on
>> performance.readdir-ahead: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io-cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> server.allow-insecure: on
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> performance.strict-write-ordering: off
>> nfs.disable: on
>> nfs.addr-namelookup: off
>> nfs.enable-ino32: off
>> cluster.granular-entry-heal: on
>>
>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>> Following steps detailed in previous recommendations began proces of
>> replacing and healngbricks one node at a time.
>>
>> 1) kill pid of brick
>> 2) reconfigure brick from raid6 to raid10
>> 3) recreate directory of brick
>> 4) gluster volume start <> force
>> 5) gluster volume heal <> full
>>
>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>> little heavy but nothing shocking.
>>
>> About an hour after node 1 finished I began same process on node2.
>> Heal proces kicked in as before and the files in directories visible from
>> mount and .glusterfs healed in short time.  Then it began crawl of .shard
>> adding those files to heal count at which point the entire proces ground 
>> to
>> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
>> heal list.  Load 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
OK. Do you also have granular-entry-heal on - just so that I can isolate
the problem area.

-Krutika

On Tue, Aug 30, 2016 at 9:55 AM, Darrell Budic 
wrote:

> I noticed that my new brick (replacement disk) did not have a .shard
> directory created on the brick, if that helps.
>
> I removed the affected brick from the volume and then wiped the disk, did
> an add-brick, and everything healed right up. I didn’t try and set any
> attrs or anything else, just removed and added the brick as new.
>
> On Aug 29, 2016, at 9:49 AM, Darrell Budic  wrote:
>
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7.
> Some content was healed correctly, now all the shards are queued up in a
> heal list, but nothing is healing. Got similar brick errors logged to the
> ones David was getting on the brick that isn’t healing:
>
> [2016-08-29 03:31:40.436110] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822:
> LOOKUP (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
> ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050]
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802:
> LOOKUP (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
> ==> (Invalid argument) [Invalid argument]
>
> This was after replacing the drive the brick was on and trying to get it
> back into the system by setting the volume's fattr on the brick dir. I’ll
> try the suggested method here on it it shortly.
>
>   -Darrell
>
>
> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay 
> wrote:
>
> Got it. Thanks.
>
> I tried the same test and shd crashed with SIGABRT (well, that's because I
> compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
>
> -Krutika
>
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>>
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
>>> wrote:
>>>
 Could you attach both client and brick logs? Meanwhile I will try these
 steps out on my machines and see if it is easily recreatable.


>>> Hoping 7z files are accepted by mail server.
>>>
>>
>> looks like zip file awaiting approval due to size
>>
>>>
>>> -Krutika

 On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
 dgoss...@carouselchecks.com> wrote:

> Centos 7 Gluster 3.8.3
>
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
>
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
>
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
>
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
> little heavy but nothing shocking.
>
> About an hour after node 1 finished I began same process on node2.
> Heal proces kicked in as before and the files in directories visible from
> mount and .glusterfs healed in short time.  Then it began crawl of .shard
> adding those files to heal count at which point the entire proces ground 
> to
> a halt basically.  After 48 hours out of 19k shards it has added 5900 to
> heal list.  Load on all 3 machnes is negligible.   It was suggested to
> change this value to full cluster.data-self-heal-algorithm and
> restart volume which I did.  No efffect.  Tried relaunching heal no 
> effect,
> despite any node picked.  I started each VM and performed a stat of all
> files from within it, or a full virus scan  

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Darrell Budic
I noticed that my new brick (replacement disk) did not have a .shard directory 
created on the brick, if that helps. 

I removed the affected brick from the volume and then wiped the disk, did an 
add-brick, and everything healed right up. I didn’t try and set any attrs or 
anything else, just removed and added the brick as new.

> On Aug 29, 2016, at 9:49 AM, Darrell Budic  wrote:
> 
> Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. Some 
> content was healed correctly, now all the shards are queued up in a heal 
> list, but nothing is healing. Got similar brick errors logged to the ones 
> David was getting on the brick that isn’t healing:
> 
> [2016-08-29 03:31:40.436110] E [MSGID: 115050] 
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP 
> (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29)
>  ==> (Invalid argument) [Invalid argument]
> [2016-08-29 03:31:43.005013] E [MSGID: 115050] 
> [server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP 
> (null) 
> (----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40)
>  ==> (Invalid argument) [Invalid argument]
> 
> This was after replacing the drive the brick was on and trying to get it back 
> into the system by setting the volume's fattr on the brick dir. I’ll try the 
> suggested method here on it it shortly.
> 
>   -Darrell
> 
> 
>> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay > > wrote:
>> 
>> Got it. Thanks.
>> 
>> I tried the same test and shd crashed with SIGABRT (well, that's because I 
>> compiled from src with -DDEBUG).
>> In any case, this error would prevent full heal from proceeding further.
>> I'm debugging the crash now. Will let you know when I have the RC.
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage > > wrote:
>> 
>> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage > > wrote:
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay > > wrote:
>> Could you attach both client and brick logs? Meanwhile I will try these 
>> steps out on my machines and see if it is easily recreatable.
>> 
>> 
>> Hoping 7z files are accepted by mail server.
>> 
>> looks like zip file awaiting approval due to size 
>> 
>> -Krutika
>> 
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage > > wrote:
>> Centos 7 Gluster 3.8.3
>> 
>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>> Options Reconfigured:
>> cluster.data-self-heal-algorithm: full
>> cluster.self-heal-daemon: on
>> cluster.locking-scheme: granular
>> features.shard-block-size: 64MB
>> features.shard: on
>> performance.readdir-ahead: on
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io -cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> server.allow-insecure: on
>> cluster.self-heal-window-size: 1024
>> cluster.background-self-heal-count: 16
>> performance.strict-write-ordering: off
>> nfs.disable: on
>> nfs.addr-namelookup: off
>> nfs.enable-ino32: off
>> cluster.granular-entry-heal: on
>> 
>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>> Following steps detailed in previous recommendations began proces of 
>> replacing and healngbricks one node at a time.
>> 
>> 1) kill pid of brick
>> 2) reconfigure brick from raid6 to raid10
>> 3) recreate directory of brick
>> 4) gluster volume start <> force
>> 5) gluster volume heal <> full
>> 
>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was little 
>> heavy but nothing shocking.
>> 
>> About an hour after node 1 finished I began same process on node2.  Heal 
>> proces kicked in as before and the files in directories visible from mount 
>> and .glusterfs healed in short time.  Then it began crawl of .shard adding 
>> those files to heal count at which point the entire proces ground to a halt 
>> basically.  After 48 hours out of 19k shards it has added 5900 to heal list. 
>>  Load on all 3 machnes is negligible.   It was suggested to change this 
>> value to full cluster.data-self-heal-algorithm and restart volume which I 
>> did.  No efffect.  Tried relaunching heal no effect, despite any node 
>> picked.  I started each VM and performed a stat of all files from within it, 
>> or a full virus scan  and that seemed to cause short small spikes in shards 
>> added, but not by much.  Logs are showing no real messages indicating 
>> anything is going 

Re: [Gluster-users] incorrect usage value on a directory

2016-08-29 Thread Sergei Gerasenko
I found an informative thread on a similar problem:

http://www.spinics.net/lists/gluster-devel/msg18400.html

According to the thread, it seems that the solution is to disable the
quota, which will clear the relevant xattrs and then re-enable the quota
which should force a recalc. I will try this tomorrow.

On Thu, Aug 11, 2016 at 9:31 AM, Sergei Gerasenko  wrote:

> Hi Selvaganesh,
>
> Thanks so much for your help. I didn’t have that option on probably
> because I originally had a lower version of cluster and then upgraded. I
> turned the option on just now.
>
> The usage is still off. Should I wait a certain time?
>
> Thanks,
>   Sergei
>
> On Aug 9, 2016, at 7:26 AM, Manikandan Selvaganesh 
> wrote:
>
> Hi Sergei,
>
> When quota is enabled, quota-deem-statfs should be set to ON(By default
> with the recent versions). But apparently
> from your 'gluster v info' output, it is like quota-deem-statfs is not on.
>
> Could you please check and confirm the same on
> /var/lib/glusterd/vols//info. If you do not find an option
> 'features.quota-deem-statfs=on', then this feature is turned off. Did you
> turn off this one? You could turn it on by doing this
> 'gluster volume set  quota-deem-statfs on'.
>
> To know more about this feature, please refer here[1]
>
> [1] https://gluster.readthedocs.io/en/latest/Administrator%20Guide/
> Directory%20Quota/
>
>
> On Tue, Aug 9, 2016 at 5:43 PM, Sergei Gerasenko 
> wrote:
>
>> Hi ,
>>
>> The gluster version is 3.7.12. Here’s the output of `gluster info`:
>>
>> Volume Name: ftp_volume
>> Type: Distributed-Replicate
>> Volume ID: SOME_VOLUME_ID
>> Status: Started
>> Number of Bricks: 3 x 2 = 6
>> Transport-type: tcp
>> Bricks:
>> Brick1: host03:/data/ftp_gluster_brick
>> Brick2: host04:/data/ftp_gluster_brick
>> Brick3: host05:/data/ftp_gluster_brick
>> Brick4: host06:/data/ftp_gluster_brick
>> Brick5: host07:/data/ftp_gluster_brick
>> Brick6: host08:/data/ftp_gluster_brick
>> Options Reconfigured:
>> features.quota: on
>>
>> Thanks for the reply!! I thought nobody would reply at this point :)
>>
>> Sergei
>>
>> On Aug 9, 2016, at 6:03 AM, Manikandan Selvaganesh 
>> wrote:
>>
>> Hi,
>>
>> Sorry, I missed the mail. May I know which version of gluster you are
>> using and please paste the output of
>> gluster v info?
>>
>> On Sat, Aug 6, 2016 at 8:19 AM, Sergei Gerasenko 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm playing with quotas and the quota list command on one of the
>>> directories claims it uses 3T, whereas the du command says only 512G is
>>> used.
>>>
>>> Anything I can do to force a re-calc, re-crawl, etc?
>>>
>>> Thanks,
>>>  Sergei
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>>
>> --
>> Regards,
>> Manikandan Selvaganesh.
>>
>>
>>
>
>
> --
> Regards,
> Manikandan Selvaganesh.
>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur  wrote:

>
>
> - Original Message -
> > From: "David Gossage" 
> > To: "Anuradha Talur" 
> > Cc: "gluster-users@gluster.org List" ,
> "Krutika Dhananjay" 
> > Sent: Monday, August 29, 2016 5:12:42 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur 
> wrote:
> >
> > > Response inline.
> > >
> > > - Original Message -
> > > > From: "Krutika Dhananjay" 
> > > > To: "David Gossage" 
> > > > Cc: "gluster-users@gluster.org List" 
> > > > Sent: Monday, August 29, 2016 3:55:04 PM
> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> > > >
> > > > Could you attach both client and brick logs? Meanwhile I will try
> these
> > > steps
> > > > out on my machines and see if it is easily recreatable.
> > > >
> > > > -Krutika
> > > >
> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> > > dgoss...@carouselchecks.com
> > > > > wrote:
> > > >
> > > >
> > > >
> > > > Centos 7 Gluster 3.8.3
> > > >
> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > > > Options Reconfigured:
> > > > cluster.data-self-heal-algorithm: full
> > > > cluster.self-heal-daemon: on
> > > > cluster.locking-scheme: granular
> > > > features.shard-block-size: 64MB
> > > > features.shard: on
> > > > performance.readdir-ahead: on
> > > > storage.owner-uid: 36
> > > > storage.owner-gid: 36
> > > > performance.quick-read: off
> > > > performance.read-ahead: off
> > > > performance.io-cache: off
> > > > performance.stat-prefetch: on
> > > > cluster.eager-lock: enable
> > > > network.remote-dio: enable
> > > > cluster.quorum-type: auto
> > > > cluster.server-quorum-type: server
> > > > server.allow-insecure: on
> > > > cluster.self-heal-window-size: 1024
> > > > cluster.background-self-heal-count: 16
> > > > performance.strict-write-ordering: off
> > > > nfs.disable: on
> > > > nfs.addr-namelookup: off
> > > > nfs.enable-ino32: off
> > > > cluster.granular-entry-heal: on
> > > >
> > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > > > Following steps detailed in previous recommendations began proces of
> > > > replacing and healngbricks one node at a time.
> > > >
> > > > 1) kill pid of brick
> > > > 2) reconfigure brick from raid6 to raid10
> > > > 3) recreate directory of brick
> > > > 4) gluster volume start <> force
> > > > 5) gluster volume heal <> full
> > > Hi,
> > >
> > > I'd suggest that full heal is not used. There are a few bugs in full
> heal.
> > > Better safe than sorry ;)
> > > Instead I'd suggest the following steps:
> > >
> > > Currently I brought the node down by systemctl stop glusterd as I was
> > getting sporadic io issues and a few VM's paused so hoping that will
> help.
> > I may wait to do this till around 4PM when most work is done in case it
> > shoots load up.
> >
> >
> > > 1) kill pid of brick
> > > 2) to configuring of brick that you need
> > > 3) recreate brick dir
> > > 4) while the brick is still down, from the mount point:
> > >a) create a dummy non existent dir under / of mount.
> > >
> >
> > so if noee 2 is down brick, pick node for example 3 and make a test dir
> > under its brick directory that doesnt exist on 2 or should I be dong this
> > over a gluster mount?
> You should be doing this over gluster mount.
> >
> > >b) set a non existent extended attribute on / of mount.
> > >
> >
> > Could you give me an example of an attribute to set?   I've read a tad on
> > this, and looked up attributes but haven't set any yet myself.
> >
> Sure. setfattr -n "user.some-name" -v "some-value" 
> > Doing these steps will ensure that heal happens only from updated brick
> to
> > > down brick.
> > > 5) gluster v start <> force
> > > 6) gluster v heal <>
> > >
> >
> > Will it matter if somewhere in gluster the full heal command was run
> other
> > day?  Not sure if it eventually stops or times out.
> >
> full heal will stop once the crawl is done. So if you want to trigger heal
> again,
> run gluster v heal <>. Actually even brick up or volume start force should
> trigger the heal.
>

Did this on test bed today.  its one server with 3 bricks on same machine
so take that for what its worth.  also it still runs 3.8.2.  Maybe ill
update and re-run test.

killed brick
deleted brick dir
recreated brick dir
created fake dir on gluster mount
set suggested fake attribute on it
ran volume start <> force

looked at files it said needed healing and it was just 8 shards that were
modified for few minutes I ran through steps

gave it few minutes and it stayed same
ran gluster volume <> heal

it healed all the directories and files you can see over mount 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Darrell Budic
Just to let you know I’m seeing the same issue under 3.7.14 on CentOS 7. Some 
content was healed correctly, now all the shards are queued up in a heal list, 
but nothing is healing. Got similar brick errors logged to the ones David was 
getting on the brick that isn’t healing:

[2016-08-29 03:31:40.436110] E [MSGID: 115050] 
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1613822: LOOKUP 
(null) 
(----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.29) 
==> (Invalid argument) [Invalid argument]
[2016-08-29 03:31:43.005013] E [MSGID: 115050] 
[server-rpc-fops.c:179:server_lookup_cbk] 0-gv0-rep-server: 1616802: LOOKUP 
(null) 
(----/0f61bf63-8ef1-4e53-8bc3-6d46590c4fb1.40) 
==> (Invalid argument) [Invalid argument]

This was after replacing the drive the brick was on and trying to get it back 
into the system by setting the volume's fattr on the brick dir. I’ll try the 
suggested method here on it it shortly.

  -Darrell


> On Aug 29, 2016, at 7:25 AM, Krutika Dhananjay  wrote:
> 
> Got it. Thanks.
> 
> I tried the same test and shd crashed with SIGABRT (well, that's because I 
> compiled from src with -DDEBUG).
> In any case, this error would prevent full heal from proceeding further.
> I'm debugging the crash now. Will let you know when I have the RC.
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 5:47 PM, David Gossage  > wrote:
> 
> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage  > wrote:
> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay  > wrote:
> Could you attach both client and brick logs? Meanwhile I will try these steps 
> out on my machines and see if it is easily recreatable.
> 
> 
> Hoping 7z files are accepted by mail server.
> 
> looks like zip file awaiting approval due to size 
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage  > wrote:
> Centos 7 Gluster 3.8.3
> 
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
> 
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of 
> replacing and healngbricks one node at a time.
> 
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
> 
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was little 
> heavy but nothing shocking.
> 
> About an hour after node 1 finished I began same process on node2.  Heal 
> proces kicked in as before and the files in directories visible from mount 
> and .glusterfs healed in short time.  Then it began crawl of .shard adding 
> those files to heal count at which point the entire proces ground to a halt 
> basically.  After 48 hours out of 19k shards it has added 5900 to heal list.  
> Load on all 3 machnes is negligible.   It was suggested to change this value 
> to full cluster.data-self-heal-algorithm and restart volume which I did.  No 
> efffect.  Tried relaunching heal no effect, despite any node picked.  I 
> started each VM and performed a stat of all files from within it, or a full 
> virus scan  and that seemed to cause short small spikes in shards added, but 
> not by much.  Logs are showing no real messages indicating anything is going 
> on.  I get hits to brick log on occasion of null lookups making me think its 
> not really crawling shards directory but waiting for a shard lookup to add 
> it.  I'll get following in brick log but not constant and sometime multiple 
> for same shard.
> 
> [2016-08-29 08:31:57.478125] W [MSGID: 115009] 
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type 
> for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050] 
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: LOOKUP 
> (null) (---00
> 

Re: [Gluster-users] [Gluster-devel] CFP for Gluster Developer Summit

2016-08-29 Thread Prashanth Pai
Hi all,

Proposing the following:

Title: Object Storage with Gluster

Agenda
* Why object storage ?
* Swift API and Amazon S3 API
* Using swift to provide object interface to Gluster volume
* Object operations
* Demo

 -Prashanth Pai

- Original Message -
> From: "Raghavendra G" 
> To: "Arthy Loganathan" 
> Cc: "Amye Scavarda" , "gluster-users Discussion List" 
> , "Gluster
> Devel" 
> Sent: Monday, 29 August, 2016 12:14:26 PM
> Subject: Re: [Gluster-devel] [Gluster-users] CFP for Gluster Developer Summit
> 
> Though its a bit late, here is one from me:
> 
> Topic: "DHT: current design, (dis)advantages, challenges - A perspective"
> 
> Agenda:
> 
> I'll try to address
> * the why's, (dis)advantages of current design. As noted in the title, this
> is my own perspective I've gathered while working on DHT. We don't have any
> existing documentation for the motivations. The source has been bugs (huge
> number of them :)), interaction with other people working on DHT and code
> reading.
> * Current work going on and a rough roadmap of what we'll be working on
> during at least next few months.
> * Going by the objectives of this talk, this might as well turn out to be a
> discussion.
> 
> regards,
> 
> On Wed, Aug 24, 2016 at 8:27 PM, Arthy Loganathan < aloga...@redhat.com >
> wrote:
> 
> 
> 
> 
> 
> 
> 
> On 08/24/2016 07:18 PM, Atin Mukherjee wrote:
> 
> 
> 
> 
> 
> On Wed, Aug 24, 2016 at 5:43 PM, Arthy Loganathan < aloga...@redhat.com >
> wrote:
> 
> 
> Hi,
> 
> I would like to propose below topic as a lightening talk.
> 
> Title: Data Logging to monitor Gluster Performance
> 
> Theme: Process and Infrastructure
> 
> To benchmark any software product, we often need to do performance analysis
> of the system along with the product. I have written a tool "System Monitor"
> to collect required data like CPU, memory usage and load average
> periodically (with graphical representation) of any process on a system.
> This data collected can help in analyzing the system & product performance.
> 
> A link to this project would definitely help here.
> 
> Hi Atin,
> 
> Here is the link to the project - https://github.com/aloganat/system_monitor
> 
> Thanks & Regards,
> Arthy
> 
> 
> 
> 
> 
> 
> 
> From this talk I would like to give an overview of this tool and explain how
> it can be used to monitor Gluster performance.
> 
> Agenda:
> - Overview of the tool and its usage
> - Collecting the data in an excel sheet at regular intervals of time
> - Plotting the graph with that data (in progress)
> - a short demo
> 
> Thanks & Regards,
> 
> Arthy
> 
> 
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> 
> --
> 
> --Atin
> 
> 
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 
> 
> --
> Raghavendra G
> 
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
Got it. Thanks.

I tried the same test and shd crashed with SIGABRT (well, that's because I
compiled from src with -DDEBUG).
In any case, this error would prevent full heal from proceeding further.
I'm debugging the crash now. Will let you know when I have the RC.

-Krutika

On Mon, Aug 29, 2016 at 5:47 PM, David Gossage 
wrote:

>
> On Mon, Aug 29, 2016 at 7:14 AM, David Gossage <
> dgoss...@carouselchecks.com> wrote:
>
>> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
>> wrote:
>>
>>> Could you attach both client and brick logs? Meanwhile I will try these
>>> steps out on my machines and see if it is easily recreatable.
>>>
>>>
>> Hoping 7z files are accepted by mail server.
>>
>
> looks like zip file awaiting approval due to size
>
>>
>> -Krutika
>>>
>>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>>> dgoss...@carouselchecks.com> wrote:
>>>
 Centos 7 Gluster 3.8.3

 Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
 Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
 Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
 Options Reconfigured:
 cluster.data-self-heal-algorithm: full
 cluster.self-heal-daemon: on
 cluster.locking-scheme: granular
 features.shard-block-size: 64MB
 features.shard: on
 performance.readdir-ahead: on
 storage.owner-uid: 36
 storage.owner-gid: 36
 performance.quick-read: off
 performance.read-ahead: off
 performance.io-cache: off
 performance.stat-prefetch: on
 cluster.eager-lock: enable
 network.remote-dio: enable
 cluster.quorum-type: auto
 cluster.server-quorum-type: server
 server.allow-insecure: on
 cluster.self-heal-window-size: 1024
 cluster.background-self-heal-count: 16
 performance.strict-write-ordering: off
 nfs.disable: on
 nfs.addr-namelookup: off
 nfs.enable-ino32: off
 cluster.granular-entry-heal: on

 Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
 Following steps detailed in previous recommendations began proces of
 replacing and healngbricks one node at a time.

 1) kill pid of brick
 2) reconfigure brick from raid6 to raid10
 3) recreate directory of brick
 4) gluster volume start <> force
 5) gluster volume heal <> full

 1st node worked as expected took 12 hours to heal 1TB data.  Load was
 little heavy but nothing shocking.

 About an hour after node 1 finished I began same process on node2.
 Heal proces kicked in as before and the files in directories visible from
 mount and .glusterfs healed in short time.  Then it began crawl of .shard
 adding those files to heal count at which point the entire proces ground to
 a halt basically.  After 48 hours out of 19k shards it has added 5900 to
 heal list.  Load on all 3 machnes is negligible.   It was suggested to
 change this value to full cluster.data-self-heal-algorithm and restart
 volume which I did.  No efffect.  Tried relaunching heal no effect, despite
 any node picked.  I started each VM and performed a stat of all files from
 within it, or a full virus scan  and that seemed to cause short small
 spikes in shards added, but not by much.  Logs are showing no real messages
 indicating anything is going on.  I get hits to brick log on occasion of
 null lookups making me think its not really crawling shards directory but
 waiting for a shard lookup to add it.  I'll get following in brick log but
 not constant and sometime multiple for same shard.

 [2016-08-29 08:31:57.478125] W [MSGID: 115009]
 [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
 type for (null) (LOOKUP)
 [2016-08-29 08:31:57.478170] E [MSGID: 115050]
 [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
 LOOKUP (null) (---00
 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
 argument) [Invalid argument]

 This one repeated about 30 times in row then nothing for 10 minutes
 then one hit for one different shard by itself.

 How can I determine if Heal is actually running?  How can I kill it or
 force restart?  Does node I start it from determine which directory gets
 crawled to determine heals?

 *David Gossage*
 *Carousel Checks Inc. | System Administrator*
 *Office* 708.613.2284

 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-users

>>>
>>>
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:14 AM, David Gossage 
wrote:

> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
> wrote:
>
>> Could you attach both client and brick logs? Meanwhile I will try these
>> steps out on my machines and see if it is easily recreatable.
>>
>>
> Hoping 7z files are accepted by mail server.
>

looks like zip file awaiting approval due to size

>
> -Krutika
>>
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> Centos 7 Gluster 3.8.3
>>>
>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>> Options Reconfigured:
>>> cluster.data-self-heal-algorithm: full
>>> cluster.self-heal-daemon: on
>>> cluster.locking-scheme: granular
>>> features.shard-block-size: 64MB
>>> features.shard: on
>>> performance.readdir-ahead: on
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: on
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> server.allow-insecure: on
>>> cluster.self-heal-window-size: 1024
>>> cluster.background-self-heal-count: 16
>>> performance.strict-write-ordering: off
>>> nfs.disable: on
>>> nfs.addr-namelookup: off
>>> nfs.enable-ino32: off
>>> cluster.granular-entry-heal: on
>>>
>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>> Following steps detailed in previous recommendations began proces of
>>> replacing and healngbricks one node at a time.
>>>
>>> 1) kill pid of brick
>>> 2) reconfigure brick from raid6 to raid10
>>> 3) recreate directory of brick
>>> 4) gluster volume start <> force
>>> 5) gluster volume heal <> full
>>>
>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>> little heavy but nothing shocking.
>>>
>>> About an hour after node 1 finished I began same process on node2.  Heal
>>> proces kicked in as before and the files in directories visible from mount
>>> and .glusterfs healed in short time.  Then it began crawl of .shard adding
>>> those files to heal count at which point the entire proces ground to a halt
>>> basically.  After 48 hours out of 19k shards it has added 5900 to heal
>>> list.  Load on all 3 machnes is negligible.   It was suggested to change
>>> this value to full cluster.data-self-heal-algorithm and restart volume
>>> which I did.  No efffect.  Tried relaunching heal no effect, despite any
>>> node picked.  I started each VM and performed a stat of all files from
>>> within it, or a full virus scan  and that seemed to cause short small
>>> spikes in shards added, but not by much.  Logs are showing no real messages
>>> indicating anything is going on.  I get hits to brick log on occasion of
>>> null lookups making me think its not really crawling shards directory but
>>> waiting for a shard lookup to add it.  I'll get following in brick log but
>>> not constant and sometime multiple for same shard.
>>>
>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
>>> type for (null) (LOOKUP)
>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
>>> LOOKUP (null) (---00
>>> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
>>> argument) [Invalid argument]
>>>
>>> This one repeated about 30 times in row then nothing for 10 minutes then
>>> one hit for one different shard by itself.
>>>
>>> How can I determine if Heal is actually running?  How can I kill it or
>>> force restart?  Does node I start it from determine which directory gets
>>> crawled to determine heals?
>>>
>>> *David Gossage*
>>> *Carousel Checks Inc. | System Administrator*
>>> *Office* 708.613.2284
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:14 AM, David Gossage 
wrote:

> On Mon, Aug 29, 2016 at 5:25 AM, Krutika Dhananjay 
> wrote:
>
>> Could you attach both client and brick logs? Meanwhile I will try these
>> steps out on my machines and see if it is easily recreatable.
>>
>>
> Hoping 7z files are accepted by mail server.
>

Also didnt do translation of timezones but in CST I started the node 1 heal
2016-08-26 20:26:42, and then the next morning i started initall node 2
heal at  2016-08-27 07:58:34

>
> -Krutika
>>
>> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
>> dgoss...@carouselchecks.com> wrote:
>>
>>> Centos 7 Gluster 3.8.3
>>>
>>> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
>>> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
>>> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
>>> Options Reconfigured:
>>> cluster.data-self-heal-algorithm: full
>>> cluster.self-heal-daemon: on
>>> cluster.locking-scheme: granular
>>> features.shard-block-size: 64MB
>>> features.shard: on
>>> performance.readdir-ahead: on
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: on
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> server.allow-insecure: on
>>> cluster.self-heal-window-size: 1024
>>> cluster.background-self-heal-count: 16
>>> performance.strict-write-ordering: off
>>> nfs.disable: on
>>> nfs.addr-namelookup: off
>>> nfs.enable-ino32: off
>>> cluster.granular-entry-heal: on
>>>
>>> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
>>> Following steps detailed in previous recommendations began proces of
>>> replacing and healngbricks one node at a time.
>>>
>>> 1) kill pid of brick
>>> 2) reconfigure brick from raid6 to raid10
>>> 3) recreate directory of brick
>>> 4) gluster volume start <> force
>>> 5) gluster volume heal <> full
>>>
>>> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
>>> little heavy but nothing shocking.
>>>
>>> About an hour after node 1 finished I began same process on node2.  Heal
>>> proces kicked in as before and the files in directories visible from mount
>>> and .glusterfs healed in short time.  Then it began crawl of .shard adding
>>> those files to heal count at which point the entire proces ground to a halt
>>> basically.  After 48 hours out of 19k shards it has added 5900 to heal
>>> list.  Load on all 3 machnes is negligible.   It was suggested to change
>>> this value to full cluster.data-self-heal-algorithm and restart volume
>>> which I did.  No efffect.  Tried relaunching heal no effect, despite any
>>> node picked.  I started each VM and performed a stat of all files from
>>> within it, or a full virus scan  and that seemed to cause short small
>>> spikes in shards added, but not by much.  Logs are showing no real messages
>>> indicating anything is going on.  I get hits to brick log on occasion of
>>> null lookups making me think its not really crawling shards directory but
>>> waiting for a shard lookup to add it.  I'll get following in brick log but
>>> not constant and sometime multiple for same shard.
>>>
>>> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
>>> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
>>> type for (null) (LOOKUP)
>>> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
>>> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
>>> LOOKUP (null) (---00
>>> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
>>> argument) [Invalid argument]
>>>
>>> This one repeated about 30 times in row then nothing for 10 minutes then
>>> one hit for one different shard by itself.
>>>
>>> How can I determine if Heal is actually running?  How can I kill it or
>>> force restart?  Does node I start it from determine which directory gets
>>> crawled to determine heals?
>>>
>>> *David Gossage*
>>> *Carousel Checks Inc. | System Administrator*
>>> *Office* 708.613.2284
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 7:01 AM, Anuradha Talur  wrote:

>
>
> - Original Message -
> > From: "David Gossage" 
> > To: "Anuradha Talur" 
> > Cc: "gluster-users@gluster.org List" ,
> "Krutika Dhananjay" 
> > Sent: Monday, August 29, 2016 5:12:42 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur 
> wrote:
> >
> > > Response inline.
> > >
> > > - Original Message -
> > > > From: "Krutika Dhananjay" 
> > > > To: "David Gossage" 
> > > > Cc: "gluster-users@gluster.org List" 
> > > > Sent: Monday, August 29, 2016 3:55:04 PM
> > > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> > > >
> > > > Could you attach both client and brick logs? Meanwhile I will try
> these
> > > steps
> > > > out on my machines and see if it is easily recreatable.
> > > >
> > > > -Krutika
> > > >
> > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> > > dgoss...@carouselchecks.com
> > > > > wrote:
> > > >
> > > >
> > > >
> > > > Centos 7 Gluster 3.8.3
> > > >
> > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > > > Options Reconfigured:
> > > > cluster.data-self-heal-algorithm: full
> > > > cluster.self-heal-daemon: on
> > > > cluster.locking-scheme: granular
> > > > features.shard-block-size: 64MB
> > > > features.shard: on
> > > > performance.readdir-ahead: on
> > > > storage.owner-uid: 36
> > > > storage.owner-gid: 36
> > > > performance.quick-read: off
> > > > performance.read-ahead: off
> > > > performance.io-cache: off
> > > > performance.stat-prefetch: on
> > > > cluster.eager-lock: enable
> > > > network.remote-dio: enable
> > > > cluster.quorum-type: auto
> > > > cluster.server-quorum-type: server
> > > > server.allow-insecure: on
> > > > cluster.self-heal-window-size: 1024
> > > > cluster.background-self-heal-count: 16
> > > > performance.strict-write-ordering: off
> > > > nfs.disable: on
> > > > nfs.addr-namelookup: off
> > > > nfs.enable-ino32: off
> > > > cluster.granular-entry-heal: on
> > > >
> > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > > > Following steps detailed in previous recommendations began proces of
> > > > replacing and healngbricks one node at a time.
> > > >
> > > > 1) kill pid of brick
> > > > 2) reconfigure brick from raid6 to raid10
> > > > 3) recreate directory of brick
> > > > 4) gluster volume start <> force
> > > > 5) gluster volume heal <> full
> > > Hi,
> > >
> > > I'd suggest that full heal is not used. There are a few bugs in full
> heal.
> > > Better safe than sorry ;)
> > > Instead I'd suggest the following steps:
> > >
> > > Currently I brought the node down by systemctl stop glusterd as I was
> > getting sporadic io issues and a few VM's paused so hoping that will
> help.
> > I may wait to do this till around 4PM when most work is done in case it
> > shoots load up.
> >
> >
> > > 1) kill pid of brick
> > > 2) to configuring of brick that you need
> > > 3) recreate brick dir
> > > 4) while the brick is still down, from the mount point:
> > >a) create a dummy non existent dir under / of mount.
> > >
> >
> > so if noee 2 is down brick, pick node for example 3 and make a test dir
> > under its brick directory that doesnt exist on 2 or should I be dong this
> > over a gluster mount?
> You should be doing this over gluster mount.
> >
> > >b) set a non existent extended attribute on / of mount.
> > >
> >
> > Could you give me an example of an attribute to set?   I've read a tad on
> > this, and looked up attributes but haven't set any yet myself.
> >
> Sure. setfattr -n "user.some-name" -v "some-value" 
>

And that can be done over gluster mount as well?  and if not and on brick
would it need to be done from both nodes that are up?


> > Doing these steps will ensure that heal happens only from updated brick
> to
> > > down brick.
> > > 5) gluster v start <> force
> > > 6) gluster v heal <>
> > >
> >
> > Will it matter if somewhere in gluster the full heal command was run
> other
> > day?  Not sure if it eventually stops or times out.
> >
> full heal will stop once the crawl is done. So if you want to trigger heal
> again,
> run gluster v heal <>. Actually even brick up or volume start force should
> trigger the heal.
>

So until it stops the initial crawl that started with heal full before that
was almost not moving am I stuck?  Does a volume restart or killing a
certain process release it?  Turning off self-heal daemon or something?

> >
> > > > 1st node worked as expected took 12 hours to heal 1TB data. Load was
> > > little
> > > > heavy but nothing shocking.
> > > >
> > > > About an hour after node 1 finished I began same process on node2.
> Heal
> 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Anuradha Talur


- Original Message -
> From: "David Gossage" 
> To: "Anuradha Talur" 
> Cc: "gluster-users@gluster.org List" , "Krutika 
> Dhananjay" 
> Sent: Monday, August 29, 2016 5:12:42 PM
> Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> 
> On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur  wrote:
> 
> > Response inline.
> >
> > - Original Message -
> > > From: "Krutika Dhananjay" 
> > > To: "David Gossage" 
> > > Cc: "gluster-users@gluster.org List" 
> > > Sent: Monday, August 29, 2016 3:55:04 PM
> > > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> > >
> > > Could you attach both client and brick logs? Meanwhile I will try these
> > steps
> > > out on my machines and see if it is easily recreatable.
> > >
> > > -Krutika
> > >
> > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> > dgoss...@carouselchecks.com
> > > > wrote:
> > >
> > >
> > >
> > > Centos 7 Gluster 3.8.3
> > >
> > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > > Options Reconfigured:
> > > cluster.data-self-heal-algorithm: full
> > > cluster.self-heal-daemon: on
> > > cluster.locking-scheme: granular
> > > features.shard-block-size: 64MB
> > > features.shard: on
> > > performance.readdir-ahead: on
> > > storage.owner-uid: 36
> > > storage.owner-gid: 36
> > > performance.quick-read: off
> > > performance.read-ahead: off
> > > performance.io-cache: off
> > > performance.stat-prefetch: on
> > > cluster.eager-lock: enable
> > > network.remote-dio: enable
> > > cluster.quorum-type: auto
> > > cluster.server-quorum-type: server
> > > server.allow-insecure: on
> > > cluster.self-heal-window-size: 1024
> > > cluster.background-self-heal-count: 16
> > > performance.strict-write-ordering: off
> > > nfs.disable: on
> > > nfs.addr-namelookup: off
> > > nfs.enable-ino32: off
> > > cluster.granular-entry-heal: on
> > >
> > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > > Following steps detailed in previous recommendations began proces of
> > > replacing and healngbricks one node at a time.
> > >
> > > 1) kill pid of brick
> > > 2) reconfigure brick from raid6 to raid10
> > > 3) recreate directory of brick
> > > 4) gluster volume start <> force
> > > 5) gluster volume heal <> full
> > Hi,
> >
> > I'd suggest that full heal is not used. There are a few bugs in full heal.
> > Better safe than sorry ;)
> > Instead I'd suggest the following steps:
> >
> > Currently I brought the node down by systemctl stop glusterd as I was
> getting sporadic io issues and a few VM's paused so hoping that will help.
> I may wait to do this till around 4PM when most work is done in case it
> shoots load up.
> 
> 
> > 1) kill pid of brick
> > 2) to configuring of brick that you need
> > 3) recreate brick dir
> > 4) while the brick is still down, from the mount point:
> >a) create a dummy non existent dir under / of mount.
> >
> 
> so if noee 2 is down brick, pick node for example 3 and make a test dir
> under its brick directory that doesnt exist on 2 or should I be dong this
> over a gluster mount?
You should be doing this over gluster mount.
> 
> >b) set a non existent extended attribute on / of mount.
> >
> 
> Could you give me an example of an attribute to set?   I've read a tad on
> this, and looked up attributes but haven't set any yet myself.
> 
Sure. setfattr -n "user.some-name" -v "some-value" 
> Doing these steps will ensure that heal happens only from updated brick to
> > down brick.
> > 5) gluster v start <> force
> > 6) gluster v heal <>
> >
> 
> Will it matter if somewhere in gluster the full heal command was run other
> day?  Not sure if it eventually stops or times out.
> 
full heal will stop once the crawl is done. So if you want to trigger heal 
again,
run gluster v heal <>. Actually even brick up or volume start force should
trigger the heal.
> >
> > > 1st node worked as expected took 12 hours to heal 1TB data. Load was
> > little
> > > heavy but nothing shocking.
> > >
> > > About an hour after node 1 finished I began same process on node2. Heal
> > > proces kicked in as before and the files in directories visible from
> > mount
> > > and .glusterfs healed in short time. Then it began crawl of .shard adding
> > > those files to heal count at which point the entire proces ground to a
> > halt
> > > basically. After 48 hours out of 19k shards it has added 5900 to heal
> > list.
> > > Load on all 3 machnes is negligible. It was suggested to change this
> > value
> > > to full cluster.data-self-heal-algorithm and restart volume which I
> > did. No
> > > efffect. Tried relaunching heal no effect, despite any node picked. I
> > > started each VM and performed a stat of all files from within it, or a
> > full
> > 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur  wrote:

> Response inline.
>
> - Original Message -
> > From: "Krutika Dhananjay" 
> > To: "David Gossage" 
> > Cc: "gluster-users@gluster.org List" 
> > Sent: Monday, August 29, 2016 3:55:04 PM
> > Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> >
> > Could you attach both client and brick logs? Meanwhile I will try these
> steps
> > out on my machines and see if it is easily recreatable.
> >
> > -Krutika
> >
> > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage <
> dgoss...@carouselchecks.com
> > > wrote:
> >
> >
> >
> > Centos 7 Gluster 3.8.3
> >
> > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> > Options Reconfigured:
> > cluster.data-self-heal-algorithm: full
> > cluster.self-heal-daemon: on
> > cluster.locking-scheme: granular
> > features.shard-block-size: 64MB
> > features.shard: on
> > performance.readdir-ahead: on
> > storage.owner-uid: 36
> > storage.owner-gid: 36
> > performance.quick-read: off
> > performance.read-ahead: off
> > performance.io-cache: off
> > performance.stat-prefetch: on
> > cluster.eager-lock: enable
> > network.remote-dio: enable
> > cluster.quorum-type: auto
> > cluster.server-quorum-type: server
> > server.allow-insecure: on
> > cluster.self-heal-window-size: 1024
> > cluster.background-self-heal-count: 16
> > performance.strict-write-ordering: off
> > nfs.disable: on
> > nfs.addr-namelookup: off
> > nfs.enable-ino32: off
> > cluster.granular-entry-heal: on
> >
> > Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> > Following steps detailed in previous recommendations began proces of
> > replacing and healngbricks one node at a time.
> >
> > 1) kill pid of brick
> > 2) reconfigure brick from raid6 to raid10
> > 3) recreate directory of brick
> > 4) gluster volume start <> force
> > 5) gluster volume heal <> full
> Hi,
>
> I'd suggest that full heal is not used. There are a few bugs in full heal.
> Better safe than sorry ;)
> Instead I'd suggest the following steps:
>
> Currently I brought the node down by systemctl stop glusterd as I was
getting sporadic io issues and a few VM's paused so hoping that will help.
I may wait to do this till around 4PM when most work is done in case it
shoots load up.


> 1) kill pid of brick
> 2) to configuring of brick that you need
> 3) recreate brick dir
> 4) while the brick is still down, from the mount point:
>a) create a dummy non existent dir under / of mount.
>

so if noee 2 is down brick, pick node for example 3 and make a test dir
under its brick directory that doesnt exist on 2 or should I be dong this
over a gluster mount?

>b) set a non existent extended attribute on / of mount.
>

Could you give me an example of an attribute to set?   I've read a tad on
this, and looked up attributes but haven't set any yet myself.

Doing these steps will ensure that heal happens only from updated brick to
> down brick.
> 5) gluster v start <> force
> 6) gluster v heal <>
>

Will it matter if somewhere in gluster the full heal command was run other
day?  Not sure if it eventually stops or times out.

>
> > 1st node worked as expected took 12 hours to heal 1TB data. Load was
> little
> > heavy but nothing shocking.
> >
> > About an hour after node 1 finished I began same process on node2. Heal
> > proces kicked in as before and the files in directories visible from
> mount
> > and .glusterfs healed in short time. Then it began crawl of .shard adding
> > those files to heal count at which point the entire proces ground to a
> halt
> > basically. After 48 hours out of 19k shards it has added 5900 to heal
> list.
> > Load on all 3 machnes is negligible. It was suggested to change this
> value
> > to full cluster.data-self-heal-algorithm and restart volume which I
> did. No
> > efffect. Tried relaunching heal no effect, despite any node picked. I
> > started each VM and performed a stat of all files from within it, or a
> full
> > virus scan and that seemed to cause short small spikes in shards added,
> but
> > not by much. Logs are showing no real messages indicating anything is
> going
> > on. I get hits to brick log on occasion of null lookups making me think
> its
> > not really crawling shards directory but waiting for a shard lookup to
> add
> > it. I'll get following in brick log but not constant and sometime
> multiple
> > for same shard.
> >
> > [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> > [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
> type
> > for (null) (LOOKUP)
> > [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> > [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> > LOOKUP (null) (---00
> > 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> > argument) [Invalid argument]
> >
> > 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Anuradha Talur
Response inline.

- Original Message -
> From: "Krutika Dhananjay" 
> To: "David Gossage" 
> Cc: "gluster-users@gluster.org List" 
> Sent: Monday, August 29, 2016 3:55:04 PM
> Subject: Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow
> 
> Could you attach both client and brick logs? Meanwhile I will try these steps
> out on my machines and see if it is easily recreatable.
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < dgoss...@carouselchecks.com
> > wrote:
> 
> 
> 
> Centos 7 Gluster 3.8.3
> 
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
> 
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
> 
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
Hi,

I'd suggest that full heal is not used. There are a few bugs in full heal.
Better safe than sorry ;)
Instead I'd suggest the following steps:

1) kill pid of brick
2) to configuring of brick that you need
3) recreate brick dir
4) while the brick is still down, from the mount point:
   a) create a dummy non existent dir under / of mount.
   b) set a non existent extended attribute on / of mount.
Doing these steps will ensure that heal happens only from updated brick to down 
brick.
5) gluster v start <> force
6) gluster v heal <>
> 
> 1st node worked as expected took 12 hours to heal 1TB data. Load was little
> heavy but nothing shocking.
> 
> About an hour after node 1 finished I began same process on node2. Heal
> proces kicked in as before and the files in directories visible from mount
> and .glusterfs healed in short time. Then it began crawl of .shard adding
> those files to heal count at which point the entire proces ground to a halt
> basically. After 48 hours out of 19k shards it has added 5900 to heal list.
> Load on all 3 machnes is negligible. It was suggested to change this value
> to full cluster.data-self-heal-algorithm and restart volume which I did. No
> efffect. Tried relaunching heal no effect, despite any node picked. I
> started each VM and performed a stat of all files from within it, or a full
> virus scan and that seemed to cause short small spikes in shards added, but
> not by much. Logs are showing no real messages indicating anything is going
> on. I get hits to brick log on occasion of null lookups making me think its
> not really crawling shards directory but waiting for a shard lookup to add
> it. I'll get following in brick log but not constant and sometime multiple
> for same shard.
> 
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type
> for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> LOOKUP (null) (---00
> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> argument) [Invalid argument]
> 
> This one repeated about 30 times in row then nothing for 10 minutes then one
> hit for one different shard by itself.
> 
> How can I determine if Heal is actually running? How can I kill it or force
> restart? Does node I start it from determine which directory gets crawled to
> determine heals?
> 
> David Gossage
> Carousel Checks Inc. | System Administrator
> Office 708.613.2284
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Thanks,
Anuradha.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Fuse memleaks, all versions → OOM-killer

2016-08-29 Thread Yannick Perret

Hello,

back after holidays. I don't saw any new relies after this last mail, I 
hope I don't missed mails (too many mails to parse…).


BTW it seems that my problem is very similar to this opened bug: 
https://bugzilla.redhat.com/show_bug.cgi?id=1369364
-> memory usage always increasing for (here) read ops until reaching all 
mem/swap, using the fuse client.


Regards,
--
Y.

Le 02/08/2016 à 19:15, Yannick Perret a écrit :
In order to prevent too many swap usage I removed swap on this machine 
(swapoff -a).

Memory usage was still growing.
After that I started an other program that takes memory (in order to 
accelerate things) and I got the OOM-killer.


Here is the syslog:
[1246854.291996] Out of memory: Kill process 931 (glusterfs) score 742 
or sacrifice child
[1246854.292102] Killed process 931 (glusterfs) total-vm:3527624kB, 
anon-rss:3100328kB, file-rss:0kB


Last VSZ/RSS was: 3527624 / 3097096


Here is the rest of the OOM-killer data:
[1246854.291847] active_anon:600785 inactive_anon:377188 isolated_anon:0
 active_file:97 inactive_file:137 isolated_file:0
 unevictable:0 dirty:0 writeback:1 unstable:0
 free:21740 slab_reclaimable:3309 slab_unreclaimable:3728
 mapped:255 shmem:4267 pagetables:3286 bounce:0
 free_cma:0
[1246854.291851] Node 0 DMA free:15876kB min:264kB low:328kB 
high:396kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? yes

[1246854.291858] lowmem_reserve[]: 0 2980 3948 3948
[1246854.291861] Node 0 DMA32 free:54616kB min:50828kB low:63532kB 
high:76240kB active_anon:1940432kB inactive_anon:1020924kB 
active_file:248kB inactive_file:260kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:3129280kB 
managed:3054836kB mlocked:0kB dirty:0kB writeback:0kB mapped:760kB 
shmem:14616kB slab_reclaimable:9660kB slab_unreclaimable:8244kB 
kernel_stack:1456kB pagetables:10056kB unstable:0kB bounce:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:803 all_unreclaimable? yes

[1246854.291865] lowmem_reserve[]: 0 0 967 967
[1246854.291867] Node 0 Normal free:16468kB min:16488kB low:20608kB 
high:24732kB active_anon:462708kB inactive_anon:487828kB 
active_file:140kB inactive_file:288kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:1048576kB 
managed:990356kB mlocked:0kB dirty:0kB writeback:4kB mapped:260kB 
shmem:2452kB slab_reclaimable:3576kB slab_unreclaimable:6668kB 
kernel_stack:560kB pagetables:3088kB unstable:0kB bounce:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:975 all_unreclaimable? yes

[1246854.291872] lowmem_reserve[]: 0 0 0 0
[1246854.291874] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 2*32kB (U) 3*64kB 
(U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB 
(EM) = 15876kB
[1246854.291882] Node 0 DMA32: 1218*4kB (UEM) 848*8kB (UE) 621*16kB 
(UE) 314*32kB (UEM) 189*64kB (UEM) 49*128kB (UEM) 2*256kB (E) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) = 54616kB
[1246854.291891] Node 0 Normal: 3117*4kB (UE) 0*8kB 0*16kB 3*32kB (R) 
1*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 
0*4096kB = 16468kB
[1246854.291900] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB

[1246854.291902] 4533 total pagecache pages
[1246854.291903] 0 pages in swap cache
[1246854.291905] Swap cache stats: add 343501, delete 343501, find 
7730690/7732743

[1246854.291906] Free swap  = 0kB
[1246854.291907] Total swap = 0kB
[1246854.291908] 1048462 pages RAM
[1246854.291909] 0 pages HighMem/MovableOnly
[1246854.291909] 14555 pages reserved
[1246854.291910] 0 pages hwpoisoned

Regards,
--
Y.



Le 02/08/2016 à 17:00, Yannick Perret a écrit :

So here are the dumps, gzip'ed.

What I did:
1. mounting the volume, removing all its content, umounting it
2. mounting the volume
3. performing a cp -Rp /usr/* /root/MNT
4. performing a rm -rf /root/MNT/*
5. taking a dump (glusterdump.p1.dump)
6. re-doing 3, 4 and 5 (glusterdump.p2.dump)

VSZ/RSS are respectively:
- 381896 / 35688 just after mount
- 644040 / 309240 after 1st cp -Rp
- 644040 / 310128 after 1st rm -rf
- 709576 / 310128 after 1st kill -USR1
- 840648 / 421964 after 2nd cp -Rp
- 840648 / 44 after 2nd rm -rf

I created a small script that performs these actions in an infinite loop:
while /bin/true
do
  cp -Rp /usr/* /root/MNT/
  + get VSZ/RSS of glusterfs process
  rm -rf /root/MNT/*
  + get VSZ/RSS of glusterfs process
done

At this time here are the values so far:
971720 533988
1037256 645500
1037256 645840
1168328 757348
1168328 757620
1299400 869128
1299400 869328
1364936 980712
1364936 980944
1496008 1092384
1496008 1092404
1627080 1203796
1627080 1203996
1692616 1315572
1692616 1315504
1823688 1426812
1823688 1427340
1954760 1538716
1954760 

Re: [Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread Krutika Dhananjay
Could you attach both client and brick logs? Meanwhile I will try these
steps out on my machines and see if it is easily recreatable.

-Krutika

On Mon, Aug 29, 2016 at 2:31 PM, David Gossage 
wrote:

> Centos 7 Gluster 3.8.3
>
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
>
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
>
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
>
> 1st node worked as expected took 12 hours to heal 1TB data.  Load was
> little heavy but nothing shocking.
>
> About an hour after node 1 finished I began same process on node2.  Heal
> proces kicked in as before and the files in directories visible from mount
> and .glusterfs healed in short time.  Then it began crawl of .shard adding
> those files to heal count at which point the entire proces ground to a halt
> basically.  After 48 hours out of 19k shards it has added 5900 to heal
> list.  Load on all 3 machnes is negligible.   It was suggested to change
> this value to full cluster.data-self-heal-algorithm and restart volume
> which I did.  No efffect.  Tried relaunching heal no effect, despite any
> node picked.  I started each VM and performed a stat of all files from
> within it, or a full virus scan  and that seemed to cause short small
> spikes in shards added, but not by much.  Logs are showing no real messages
> indicating anything is going on.  I get hits to brick log on occasion of
> null lookups making me think its not really crawling shards directory but
> waiting for a shard lookup to add it.  I'll get following in brick log but
> not constant and sometime multiple for same shard.
>
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution
> type for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> LOOKUP (null) (---00
> 00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> argument) [Invalid argument]
>
> This one repeated about 30 times in row then nothing for 10 minutes then
> one hit for one different shard by itself.
>
> How can I determine if Heal is actually running?  How can I kill it or
> force restart?  Does node I start it from determine which directory gets
> crawled to determine heals?
>
> *David Gossage*
> *Carousel Checks Inc. | System Administrator*
> *Office* 708.613.2284
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] 3.8.3 Shards Healing Glacier Slow

2016-08-29 Thread David Gossage
Centos 7 Gluster 3.8.3

Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
Options Reconfigured:
cluster.data-self-heal-algorithm: full
cluster.self-heal-daemon: on
cluster.locking-scheme: granular
features.shard-block-size: 64MB
features.shard: on
performance.readdir-ahead: on
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
server.allow-insecure: on
cluster.self-heal-window-size: 1024
cluster.background-self-heal-count: 16
performance.strict-write-ordering: off
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off
cluster.granular-entry-heal: on

Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
Following steps detailed in previous recommendations began proces of
replacing and healngbricks one node at a time.

1) kill pid of brick
2) reconfigure brick from raid6 to raid10
3) recreate directory of brick
4) gluster volume start <> force
5) gluster volume heal <> full

1st node worked as expected took 12 hours to heal 1TB data.  Load was
little heavy but nothing shocking.

About an hour after node 1 finished I began same process on node2.  Heal
proces kicked in as before and the files in directories visible from mount
and .glusterfs healed in short time.  Then it began crawl of .shard adding
those files to heal count at which point the entire proces ground to a halt
basically.  After 48 hours out of 19k shards it has added 5900 to heal
list.  Load on all 3 machnes is negligible.   It was suggested to change
this value to full cluster.data-self-heal-algorithm and restart volume
which I did.  No efffect.  Tried relaunching heal no effect, despite any
node picked.  I started each VM and performed a stat of all files from
within it, or a full virus scan  and that seemed to cause short small
spikes in shards added, but not by much.  Logs are showing no real messages
indicating anything is going on.  I get hits to brick log on occasion of
null lookups making me think its not really crawling shards directory but
waiting for a shard lookup to add it.  I'll get following in brick log but
not constant and sometime multiple for same shard.

[2016-08-29 08:31:57.478125] W [MSGID: 115009]
[server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type
for (null) (LOOKUP)
[2016-08-29 08:31:57.478170] E [MSGID: 115050]
[server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
LOOKUP (null) (---00
00-/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
argument) [Invalid argument]

This one repeated about 30 times in row then nothing for 10 minutes then
one hit for one different shard by itself.

How can I determine if Heal is actually running?  How can I kill it or
force restart?  Does node I start it from determine which directory gets
crawled to determine heals?

*David Gossage*
*Carousel Checks Inc. | System Administrator*
*Office* 708.613.2284
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] So what are people using for 10G nics

2016-08-29 Thread Ivan Rossi
X540-t2 now, but in the past we used Solarflare with no particular issues.

Il 26/ago/2016 22:32, "Diego Remolina"  ha scritto:

> Servers now also come with the copper 10Gbit network adapters built in the
> motherboard (Dell R730, supermicro, etc). But for those that do not, I have
> used the Intel X540-T2 adapters with Centos 7 and RHEL7.
>
> As for switches, our infrastructure uses expensive Cisco 9XXX series and
> FEX expanders, so cannot really say much about "inexpensive" ones.
>
> Diego
>
> On Aug 26, 2016 16:05, "WK"  wrote:
>
>> Prices seem to be dropping online at NewEgg etc and going from 2 nodes to
>> 3 nodes for a quorum implies a lot more traffic than would be comfortable
>> with 1G.
>>
>> Any NIC/Switch recommendations for RH/Cent 7.x and Ubuntu 16?
>>
>>
>> -wk
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] [Gluster-devel] CFP for Gluster Developer Summit

2016-08-29 Thread Raghavendra G
Though its a bit late, here is one from me:

Topic: "DHT: current design, (dis)advantages, challenges - A perspective"

Agenda:

I'll try to address
* the why's, (dis)advantages of current design. As noted in the title, this
is my own perspective I've gathered while working on DHT. We don't have any
existing documentation for the motivations. The source has been bugs (huge
number of them :)), interaction with other people working on DHT and code
reading.
* Current work going on and a rough roadmap of what we'll be working on
during at least next few months.
* Going by the objectives of this talk, this might as well turn out to be a
discussion.

regards,

On Wed, Aug 24, 2016 at 8:27 PM, Arthy Loganathan 
wrote:

>
>
> On 08/24/2016 07:18 PM, Atin Mukherjee wrote:
>
>
>
> On Wed, Aug 24, 2016 at 5:43 PM, Arthy Loganathan < 
> aloga...@redhat.com> wrote:
>
>> Hi,
>>
>> I would like to propose below topic as a lightening talk.
>>
>> Title: Data Logging to monitor Gluster Performance
>>
>> Theme: Process and Infrastructure
>>
>> To benchmark any software product, we often need to do performance
>> analysis of the system along with the product. I have written a tool
>> "System Monitor" to collect required data like CPU, memory usage and load
>> average periodically (with graphical representation) of any process on a
>> system. This data collected can help in analyzing the system & product
>> performance.
>>
>
> A link to this project would definitely help here.
>
>
> Hi Atin,
>
> Here is the link to the project - https://github.com/aloganat/
> system_monitor
>
> Thanks & Regards,
> Arthy
>
>
>
>>
>> From this talk I would like to give an overview of this tool and explain
>> how it can be used to monitor Gluster performance.
>>
>> Agenda:
>>   - Overview of the tool and its usage
>>   - Collecting the data in an excel sheet at regular intervals of time
>>   - Plotting the graph with that data (in progress)
>>   - a short demo
>>
>> Thanks & Regards,
>>
>> Arthy
>>
>>
>>
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
>
> --Atin
>
>
>
> ___
> Gluster-devel mailing list
> gluster-de...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users