[Gluster-users] sshd raid 5 gluster 3.7.13

2016-07-30 Thread Ricky Venerayan
Anyone of you have used sshd raid 5 with gluster 3.7.13?  if not, I will be 
using it, and will post the outcome.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] anyone who have use sshd raid 5 with gluster 3.7.13?

2016-07-30 Thread Lindsay Mathieson

On 31/07/2016 10:46 AM, Lenovo Lastname wrote:
anyone who have use sshd raid 5 with gluster 3.7.13 with sharding? if 
not then I will let you know what will happened.  I will be using 
Seagate sshd 1TB x3 with 32G nand.


Do you mean the uderlying brick is raid5? how many bricks and what 
replication factor?



BTW the font in your emails is so small as to be illegible.

--
Lindsay Mathieson

___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] anyone who have use sshd raid 5 with gluster 3.7.13?

2016-07-30 Thread Lenovo Lastname
 anyone who have use sshd raid 5 with gluster 3.7.13 with sharding? if not then 
I will let you know what will happened.  I will be using Seagate sshd 1TB x3 
with 32G nand.___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] managing slow drives in cluster

2016-07-30 Thread Jay Berkenbilt
We're using glusterfs in Amazon EC2 and observing certain behavior
involving EBS volumes. The basic situation is that, in some cases,
clients can write data to the file system at a rate such that the
gluster daemon on one or more of the nodes may block in disk wait for
longer than 42 seconds, causing gluster to decide that the brick is
down. In fact, it's not down, it's just slow. I believe it is possible
by looking at certain system data to tell the difference from the system
with the drive on it between down and working through its queue.

We are attempting a two-pronged approach to solving this problem:

1. We would like to figure out how to tune the system, including either
or both of adjusting kernel parameters or glusterd, to try to avoid
getting the system into the state of having so much data to flush out to
disk that it blocks in disk wait for such a long time.
2. We would like to see if we can make gluster more intelligent about
responding to the pings so that the client side is still getting a
response when the remote side is just behind and not down. Though I do
understand that, in some high performance environments, one may want to
consider a disk that's not keeping up to have failed, so this may have
to be a tunable parameter.

We have a small team that has been working on this problem for a couple
of weeks. I just joined the team on Friday. I am new to gluster, but I
am not at all new to low-level system programming, Linux administration,
etc. I'm very much open to the possibility of digging into the gluster
code and supplying patches if we can find a way to adjust the behavior
of gluster to make it behave better under these conditions.

So, here are my questions:

* Does anyone have experience with this type of issue who can offer any
suggestions on kernel parameters or gluster configurations we could play
with? We have several kernel parameters in mind and are starting to
measure their affect.
* Does anyone have any background on how we might be able to tell that
the system is getting itself into this state? Again, we have some ideas
on this already, mostly by using sysstat to monitor stuff, though
ultimately if we find a reliable way to do it, we'd probably code it
directly by looking at the relevant stuff in /proc from our own code. I
don't have the details with me right now.
* Can someone provide any pointers to where in the gluster code the ping
logic is handled and/or how one might go about making it a little smarter?
* Does my description of what we're dealing with suggest that we're just
missing something obvious? I jokingly asked the team whether they had
remembered to run glusterd with the --make-it-fast flag, but sometimes
there are solutions almost like that that we just overlook.

For what it's worth, we're running gluster 3.8 on CentOS 7 in EC2. We
see the problem the most strongly when using general purpose (gp2) EBS
volumes on higher performance but non-EBS optimized volumes where it's
pretty easy to overload the disk with traffic over the network. We can
mostly mitigate this by using provisioned I/O volumes or EBS optimized
volumes on slower instances where the disk outperforms what we can throw
at it over the network. Yet at our scale, switching to EBS optimization
would cost hundreds of thousands of dollars a year, and running slower
instances has obvious drawbacks. In the absence of a "real" solution, we
will probably end up trying to modify our software to throttle writes to
disk, but having to modify our software to keep from flooding the file
system seems like a really sad thing to have to do.

Thanks in advance for any pointers!

--Jay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster 3.7.13 NFS Crash

2016-07-30 Thread Soumya Koduri

Inode stored in the shard xlator local is NULL.  CCin Kruthika to comment.

Thanks,
Soumya



(gdb) bt
#0  0x7f196acab210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x7f196be7bcd5 in fd_anonymous (inode=0x0) at fd.c:804
#2  0x7f195deb1787 in shard_common_inode_write_do
(frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716
#3  0x7f195deb1a53 in
shard_common_inode_write_post_lookup_shards_handler (frame=, this=) at shard.c:3769
#4  0x7f195deaaff5 in shard_common_lookup_shards_cbk
(frame=0x7f19699f1164, cookie=, this=0x7f195802ac10,
op_ret=0,
op_errno=, inode=, buf=0x7f194970bc40,
xdata=0x7f196c15451c, postparent=0x7f194970bcb0) at shard.c:1601
#5  0x7f195e10a141 in dht_lookup_cbk (frame=0x7f196998e7d4,
cookie=, this=, op_ret=0, op_errno=0,
inode=0x7f195c532b18,
stbuf=0x7f194970bc40, xattr=0x7f196c15451c,
postparent=0x7f194970bcb0) at dht-common.c:2174
#6  0x7f195e3931f3 in afr_lookup_done
(frame=frame@entry=0x7f196997f8a4, this=this@entry=0x7f1958022a20) at
afr-common.c:1825
#7  0x7f195e393b84 in afr_lookup_metadata_heal_check
(frame=frame@entry=0x7f196997f8a4, this=0x7f1958022a20,
this@entry=0xe3a929e0b67fa500)
at afr-common.c:2068
#8  0x7f195e39434f in afr_lookup_entry_heal
(frame=frame@entry=0x7f196997f8a4, this=0xe3a929e0b67fa500,
this@entry=0x7f1958022a20) at afr-common.c:2157
#9  0x7f195e39467d in afr_lookup_cbk (frame=0x7f196997f8a4,
cookie=, this=0x7f1958022a20, op_ret=,
op_errno=, inode=, buf=0x7f195effa940,
xdata=0x7f196c1853b0, postparent=0x7f195effa9b0) at afr-common.c:2205
#10 0x7f195e5e2e42 in client3_3_lookup_cbk (req=,
iov=, count=, myframe=0x7f19652c)
at client-rpc-fops.c:2981
#11 0x7f196bc0ca30 in rpc_clnt_handle_reply
(clnt=clnt@entry=0x7f19583adaf0, pollin=pollin@entry=0x7f195907f930) at
rpc-clnt.c:764
#12 0x7f196bc0ccef in rpc_clnt_notify (trans=,
mydata=0x7f19583adb20, event=, data=0x7f195907f930) at
rpc-clnt.c:925
#13 0x7f196bc087c3 in rpc_transport_notify
(this=this@entry=0x7f19583bd770,
event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
data=data@entry=0x7f195907f930)
at rpc-transport.c:546
#14 0x7f1960acf9a4 in socket_event_poll_in
(this=this@entry=0x7f19583bd770) at socket.c:2353
#15 0x7f1960ad25e4 in socket_event_handler (fd=fd@entry=25,
idx=idx@entry=14, data=0x7f19583bd770, poll_in=1, poll_out=0,
poll_err=0) at socket.c:2466
#16 0x7f196beacf7a in event_dispatch_epoll_handler
(event=0x7f195effae80, event_pool=0x7f196dbf5f20) at event-epoll.c:575
#17 event_dispatch_epoll_worker (data=0x7f196dc41e10) at event-epoll.c:678
#18 0x7f196aca6dc5 in start_thread () from /lib64/libpthread.so.0
#19 0x7f196a5ebced in clone () from /lib64/libc.so.6




nfs logs and the core dump can be found in the dropbox link below;
https://db.tt/rZrC9d7f


thanks in advance.

Respectfully*
**Mahdi A. Mahdi*



___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Gluster 3.7.13 NFS Crash

2016-07-30 Thread Mahdi Adnan
Hi,
I really appreciate if someone can help me fix my nfs crash, its happening a 
lot and it's causing lots of issues to my VMs;the problem is every few hours 
the native nfs crash and the volume become unavailable from the affected node 
unless i restart glusterd.the volume is used by vmware esxi as a datastore for 
it's VMs with the following options;

OS: CentOS 7.2Gluster: 3.7.13
Volume Name: vlm01Type: Distributed-ReplicateVolume ID: 
eacd8248-dca3-4530-9aed-7714a5a114f2Status: StartedNumber of Bricks: 7 x 3 = 
21Transport-type: tcpBricks:Brick1: gfs01:/bricks/b01/vlm01Brick2: 
gfs02:/bricks/b01/vlm01Brick3: gfs03:/bricks/b01/vlm01Brick4: 
gfs01:/bricks/b02/vlm01Brick5: gfs02:/bricks/b02/vlm01Brick6: 
gfs03:/bricks/b02/vlm01Brick7: gfs01:/bricks/b03/vlm01Brick8: 
gfs02:/bricks/b03/vlm01Brick9: gfs03:/bricks/b03/vlm01Brick10: 
gfs01:/bricks/b04/vlm01Brick11: gfs02:/bricks/b04/vlm01Brick12: 
gfs03:/bricks/b04/vlm01Brick13: gfs01:/bricks/b05/vlm01Brick14: 
gfs02:/bricks/b05/vlm01Brick15: gfs03:/bricks/b05/vlm01Brick16: 
gfs01:/bricks/b06/vlm01Brick17: gfs02:/bricks/b06/vlm01Brick18: 
gfs03:/bricks/b06/vlm01Brick19: gfs01:/bricks/b07/vlm01Brick20: 
gfs02:/bricks/b07/vlm01Brick21: gfs03:/bricks/b07/vlm01Options 
Reconfigured:performance.readdir-ahead: offperformance.quick-read: 
offperformance.read-ahead: offperformance.io-cache: 
offperformance.stat-prefetch: offcluster.eager-lock: enablenetwork.remote-dio: 
enablecluster.quorum-type: autocluster.server-quorum-type: 
serverperformance.strict-write-ordering: onperformance.write-behind: 
offcluster.data-self-heal-algorithm: fullcluster.self-heal-window-size: 
128features.shard-block-size: 16MBfeatures.shard: onauth.allow: 
192.168.221.50,192.168.221.51,192.168.221.52,192.168.221.56,192.168.208.130,192.168.208.131,192.168.208.132,192.168.208.89,192.168.208.85,192.168.208.208.86network.ping-timeout:
 10

latest bt;

(gdb) bt #0  0x7f196acab210 in pthread_spin_lock () from 
/lib64/libpthread.so.0#1  0x7f196be7bcd5 in fd_anonymous (inode=0x0) at 
fd.c:804#2  0x7f195deb1787 in shard_common_inode_write_do 
(frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716#3  
0x7f195deb1a53 in shard_common_inode_write_post_lookup_shards_handler 
(frame=, this=) at shard.c:3769#4  
0x7f195deaaff5 in shard_common_lookup_shards_cbk (frame=0x7f19699f1164, 
cookie=, this=0x7f195802ac10, op_ret=0, op_errno=, inode=, buf=0x7f194970bc40, xdata=0x7f196c15451c, 
postparent=0x7f194970bcb0) at shard.c:1601#5  0x7f195e10a141 in 
dht_lookup_cbk (frame=0x7f196998e7d4, cookie=, this=, op_ret=0, op_errno=0, inode=0x7f195c532b18, stbuf=0x7f194970bc40, 
xattr=0x7f196c15451c, postparent=0x7f194970bcb0) at dht-common.c:2174#6  
0x7f195e3931f3 in afr_lookup_done (frame=frame@entry=0x7f196997f8a4, 
this=this@entry=0x7f1958022a20) at afr-common.c:1825#7  0x7f195e393b84 in 
afr_lookup_metadata_heal_check (frame=frame@entry=0x7f196997f8a4, 
this=0x7f1958022a20, this@entry=0xe3a929e0b67fa500)at afr-common.c:2068#8  
0x7f195e39434f in afr_lookup_entry_heal (frame=frame@entry=0x7f196997f8a4, 
this=0xe3a929e0b67fa500, this@entry=0x7f1958022a20) at afr-common.c:2157#9  
0x7f195e39467d in afr_lookup_cbk (frame=0x7f196997f8a4, cookie=, this=0x7f1958022a20, op_ret=, op_errno=, inode=, buf=0x7f195effa940, xdata=0x7f196c1853b0, 
postparent=0x7f195effa9b0) at afr-common.c:2205#10 0x7f195e5e2e42 in 
client3_3_lookup_cbk (req=, iov=, 
count=, myframe=0x7f19652c)at client-rpc-fops.c:2981#11 
0x7f196bc0ca30 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f19583adaf0, 
pollin=pollin@entry=0x7f195907f930) at rpc-clnt.c:764#12 0x7f196bc0ccef in 
rpc_clnt_notify (trans=, mydata=0x7f19583adb20, event=, data=0x7f195907f930) at rpc-clnt.c:925#13 0x7f196bc087c3 in 
rpc_transport_notify (this=this@entry=0x7f19583bd770, 
event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f195907f930)   
 at rpc-transport.c:546#14 0x7f1960acf9a4 in socket_event_poll_in 
(this=this@entry=0x7f19583bd770) at socket.c:2353#15 0x7f1960ad25e4 in 
socket_event_handler (fd=fd@entry=25, idx=idx@entry=14, data=0x7f19583bd770, 
poll_in=1, poll_out=0, poll_err=0) at socket.c:2466#16 0x7f196beacf7a in 
event_dispatch_epoll_handler (event=0x7f195effae80, event_pool=0x7f196dbf5f20) 
at event-epoll.c:575#17 event_dispatch_epoll_worker (data=0x7f196dc41e10) at 
event-epoll.c:678#18 0x7f196aca6dc5 in start_thread () from 
/lib64/libpthread.so.0#19 0x7f196a5ebced in clone () from /lib64/libc.so.6



nfs logs and the core dump can be found in the dropbox link 
below;https://db.tt/rZrC9d7f


thanks in advance.Respectfully
Mahdi A. Mahdi

  ___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users