Re: [Gluster-users] Help analise statedumps

2019-02-05 Thread Pedro Costa
Hi Sanju,

The process was `glusterfs`, yes I took the statedump for the same process 
(different PID since it was rebooted).

Cheers,
P.

From: Sanju Rakonde 
Sent: 04 February 2019 06:10
To: Pedro Costa 
Cc: gluster-users 
Subject: Re: [Gluster-users] Help analise statedumps

Hi,

Can you please specify which process has leak? Have you took the statedump of 
the same process which has leak?

Thanks,
Sanju

On Sat, Feb 2, 2019 at 3:15 PM Pedro Costa 
mailto:pedro@pmc.digital>> wrote:
Hi,

I have a 3x replicated cluster running 4.1.7 on ubuntu 16.04.5, all 3 replicas 
are also clients hosting a Node.js/Nginx web server.

The current configuration is as such:

Volume Name: gvol1
Type: Replicate
Volume ID: XX
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: vm00:/srv/brick1/gvol1
Brick2: vm01:/srv/brick1/gvol1
Brick3: vm02:/srv/brick1/gvol1
Options Reconfigured:
cluster.self-heal-readdir-size: 2KB
cluster.self-heal-window-size: 2
cluster.background-self-heal-count: 20
network.ping-timeout: 5
disperse.eager-lock: off
performance.parallel-readdir: on
performance.readdir-ahead: on
performance.rda-cache-limit: 128MB
performance.cache-refresh-timeout: 10
performance.nl-cache-timeout: 600
performance.nl-cache: on
cluster.nufa: on
performance.enable-least-priority: off
server.outstanding-rpc-limit: 128
performance.strict-o-direct: on
cluster.shd-max-threads: 12
client.event-threads: 4
cluster.lookup-optimize: on
network.inode-lru-limit: 9
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.cache-samba-metadata: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
features.utime: on
storage.ctime: on
server.event-threads: 4
performance.cache-size: 256MB
performance.read-ahead: on
cluster.readdir-optimize: on
cluster.strict-readdir: on
performance.io-thread-count: 8
server.allow-insecure: on
cluster.read-hash-mode: 0
cluster.lookup-unhashed: auto
cluster.choose-local: on

I believe there’s a memory leak somewhere, it just keeps going up until it 
hangs one or more nodes taking the whole cluster down sometimes.

I have taken 2 statedumps on one of the nodes, one where the memory is too high 
and another just after a reboot with the app running and the volume fully 
healed.

https://pmcdigital.sharepoint.com/:u:/g/EYDsNqTf1UdEuE6B0ZNVPfIBf_I-AbaqHotB1lJOnxLlTg?e=boYP09
 (high memory)

https://pmcdigital.sharepoint.com/:u:/g/EWZBsnET2xBHl6OxO52RCfIBvQ0uIDQ1GKJZ1GrnviyMhg?e=wI3yaY
  (after reboot)

Any help would be greatly appreciated,

Kindest Regards,

Pedro Maia Costa
Senior Developer, pmc.digital
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


--
Thanks,
Sanju
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Corrupted File readable via FUSE?

2019-02-05 Thread FNU Raghavendra Manjunath
Hi David,

Do you have any bricks down? Can you please share the output of the
following commands and also the logs of the server and the client nodes?

1) gluster volume info
2) gluster volume status
3) gluster volume bitrot  scrub status

Few more questions

1) How many copies of the file were corrupted? (All? Or Just one?)

2 things I am trying to understand

A) IIUC, if only one copy is corrupted, then the replication module from
the gluster client should serve the data from the
remaining good copy
B) If all the copies were corrupted (or say more than quorum copies were
corrupted which means 2 in case of 3 way replication)
then there will be an error to the application. But the error to be
reported should 'Input/Output Error'. Not 'Transport endpoint not connected'
   'Transport endpoint not connected' error usually comes when a brick
where the operation is being directed to is not connected to the client.



Regards,
Raghavendra

On Mon, Feb 4, 2019 at 6:02 AM David Spisla  wrote:

> Hello Amar,
> sounds good. Until now this patch is only merged into master. I think it
> should be part of the next v5.x patch release!
>
> Regards
> David
>
> Am Mo., 4. Feb. 2019 um 09:58 Uhr schrieb Amar Tumballi Suryanarayan <
> atumb...@redhat.com>:
>
>> Hi David,
>>
>> I guess https://review.gluster.org/#/c/glusterfs/+/21996/ helps to fix
>> the issue. I will leave it to Raghavendra Bhat to reconfirm.
>>
>> Regards,
>> Amar
>>
>> On Fri, Feb 1, 2019 at 8:45 PM David Spisla  wrote:
>>
>>> Hello Gluster Community,
>>> I have got a 4 Node Cluster with a Replica 4 Volume, so each node has a
>>> brick with a copy of a file. Now I tried out the bitrot functionality and
>>> corrupt the copy on the brick of node1. After this I scrub ondemand and the
>>> file is marked correctly as corrupted.
>>>
>>> No I try to read that file from FUSE on node1 (with corrupt copy):
>>> $ cat file1.txt
>>> cat: file1.txt: Transport endpoint is not connected
>>> FUSE log says:
>>>
>>> *[2019-02-01 15:02:19.191984] E [MSGID: 114031]
>>> [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote
>>> operation failed. Path: /data/file1.txt
>>> (b432c1d6-ece2-42f2-8749-b11e058c4be3) [Input/output error]*
>>> [2019-02-01 15:02:19.192269] W [dict.c:761:dict_ref]
>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
>>> [0x7fc642471329]
>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
>>> [0x7fc642682af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
>>> [0x7fc64a78d218] ) 0-dict: dict is NULL [Invalid argument]
>>> [2019-02-01 15:02:19.192714] E [MSGID: 108009]
>>> [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to
>>> open /data/file1.txt on subvolume archive1-client-0 [Input/output error]
>>> *[2019-02-01 15:02:19.193009] W [fuse-bridge.c:2371:fuse_readv_cbk]
>>> 0-glusterfs-fuse: 147733: READ => -1
>>> gfid=b432c1d6-ece2-42f2-8749-b11e058c4be3 fd=0x7fc60408bbb8 (Transport
>>> endpoint is not connected)*
>>> [2019-02-01 15:02:19.193653] W [MSGID: 114028]
>>> [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not
>>> valid [Invalid argument]
>>>
>>> And from FUSE on node2 (with heal copy):
>>> $ cat file1.txt
>>> file1
>>>
>>> It seems to be that node1 wants to get the file from its own brick, but
>>> the copy there is broken. Node2 gets the file from its own brick with a
>>> heal copy, so reading the file succeed.
>>> But I am wondering myself because sometimes reading the file from node1
>>> with the broken copy succeed
>>>
>>> What is the expected behaviour here? Is it possibly to read files with a
>>> corrupted copy from any client access?
>>>
>>> Regards
>>> David Spisla
>>>
>>>
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>> --
>> Amar Tumballi (amarts)
>>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Getting timedout error while rebalancing

2019-02-05 Thread Nithya Balachandran
On Tue, 5 Feb 2019 at 17:26, deepu srinivasan  wrote:

> HI Nithya
> We have a test gluster setup.We are testing the rebalancing option of
> gluster. So we started the volume which have 1x3 brick with some data on it
> .
> command : gluster volume create test-volume replica 3
> 192.168.xxx.xx1:/home/data/repl 192.168.xxx.xx2:/home/data/repl
> 192.168.xxx.xx3:/home/data/repl.
>
> Now we tried to expand the cluster storage by adding three more bricks.
> command : gluster volume add-brick test-volume 192.168.xxx.xx4:/home/data/repl
> 192.168.xxx.xx5:/home/data/repl 192.168.xxx.xx6:/home/data/repl
>
> So after the brick addition we tried to rebalance the layout and the data.
> command : gluster volume rebalance test-volume fix-layout start.
> The command exited with status "Error : Request timed out".
>

This sounds like an error in the cli or glusterd. Can you send the
glusterd.log from the node on which you ran the command?

regards,
Nithya

>
> After the failure of the command, we tried to view the status of the
> command and it is something like this :
>
> Node Rebalanced-files  size
> scanned  failures   skipped   status  run time in
> h:m:s
>
>-  ---   ---   
> ---
>   ---   ---  --
>
>localhost   4141.0MB
> 8200 0 0completed
> 0:00:09
>
>  192.168.xxx.xx4   7979.0MB
> 8231 0 0completed
> 0:00:12
>
>  192.168.xxx.xx6   5858.0MB
> 8281 0 0completed
> 0:00:10
>
>  192.168.xxx.xx2  136   136.0MB
> 8566 0   136completed
> 0:00:07
>
>  192.168.xxx.xx4  129   129.0MB
> 8566 0   129completed
> 0:00:07
>
>  192.168.xxx.xx6  201   201.0MB
> 8566 0   201completed
> 0:00:08
>
> Is the rebalancing option working fine? Why did gluster  throw the error
> saying that "Error : Request timed out"?
> .On Tue, Feb 5, 2019 at 4:23 PM Nithya Balachandran 
> wrote:
>
>> Hi,
>> Please provide the exact step at which you are seeing the error. It would
>> be ideal if you could copy-paste the command and the error.
>>
>> Regards,
>> Nithya
>>
>>
>>
>> On Tue, 5 Feb 2019 at 15:24, deepu srinivasan  wrote:
>>
>>> HI everyone. I am getting "Error : Request timed out " while doing
>>> rebalance . I have aded new bricks to my replicated volume.i.e. First it
>>> was 1x3 volume and added three more bricks to make it
>>> distributed-replicated volume(2x3) . What should i do for the timeout error
>>> ?
>>> ___
>>> Gluster-users mailing list
>>> Gluster-users@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Getting timedout error while rebalancing

2019-02-05 Thread Nithya Balachandran
Hi,
Please provide the exact step at which you are seeing the error. It would
be ideal if you could copy-paste the command and the error.

Regards,
Nithya



On Tue, 5 Feb 2019 at 15:24, deepu srinivasan  wrote:

> HI everyone. I am getting "Error : Request timed out " while doing
> rebalance . I have aded new bricks to my replicated volume.i.e. First it
> was 1x3 volume and added three more bricks to make it
> distributed-replicated volume(2x3) . What should i do for the timeout error
> ?
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Getting timedout error while rebalancing

2019-02-05 Thread deepu srinivasan
HI everyone. I am getting "Error : Request timed out " while doing
rebalance . I have aded new bricks to my replicated volume.i.e. First it
was 1x3 volume and added three more bricks to make it
distributed-replicated volume(2x3) . What should i do for the timeout error
?
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users