Re: [Gluster-users] NFS over gluster stops responding under write load

2012-09-20 Thread Lonni J Friedman
No replies?  Surely this can't be expected behavior?

On Tue, Sep 18, 2012 at 2:59 PM, Lonni J Friedman netll...@gmail.com wrote:
 Greetings,
 I'm running version 3.3.0 on Fedora16-x86_64.  I have two bricks setup
 with a volume doing basic replication on an XFS formatted filesystem.
 I've NFS mounted the volume on a 3rd system, and invoked bonnie++ to
 write to the NFS mount point.  After a few minutes, I noticed that
 bonnie++ didn't seem to be generating any more progress output, at
 which point I started checking assorted logs to see if anything was
 wrong.  At that point, I saw that the client system was no longer able
 to write to the NFS mount point, and dmesg (and /var/log/messages) was
 spewing these warnings like crazy (dozens/second):
 nfs: server 10.31.99.166 not responding, still trying trying

 Those warnings started at 14:40:58 on the client system, but oddly
 stopped a few seconds later at 14:41:04.  Here's the full bonnie++
 output (/mnt/gv0 is where the gluster file system is mounted as an NFS
 client):
 
 [root@cuda-ljf0 ~]# bonnie++ -d /mnt/gv0 -u root
 Using uid:0, gid:0.
 Writing a byte at a time...done
 Writing intelligently...done
 Rewriting...
 

 Here's what's in the glusterfs logs at the moment:
 ##
 # tail etc-glusterfs-glusterd.vol.log
 [2012-09-18 14:54:39.026557] I
 [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received
 op from uuid: 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
 [2012-09-18 14:54:39.029463] I
 [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd:
 Responded to stage, ret: 0
 [2012-09-18 14:54:46.993426] I
 [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
 0-management: Received heal vol req for volume gv0
 [2012-09-18 14:54:46.993503] E [glusterd-utils.c:277:glusterd_lock]
 0-glusterd: Unable to get lock for uuid:
 e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
 [2012-09-18 14:54:46.993520] E
 [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
 acquire local lock, ret: -1
 [2012-09-18 14:55:47.175521] I
 [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
 Received get vol req
 [2012-09-18 14:55:47.181048] I
 [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
 Received get vol req
 [2012-09-18 14:55:49.306776] I
 [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
 0-management: Received heal vol req for volume gv0
 [2012-09-18 14:55:49.306834] E [glusterd-utils.c:277:glusterd_lock]
 0-glusterd: Unable to get lock for uuid:
 e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
 [2012-09-18 14:55:49.306844] E
 [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
 acquire local lock, ret: -1
 # tail -f cli.log
 [2012-09-18 14:55:47.176824] I
 [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
 get vol: 0
 [2012-09-18 14:55:47.180959] I
 [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
 [2012-09-18 14:55:47.181128] I
 [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
 get vol: 0
 [2012-09-18 14:55:47.181167] I
 [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
 [2012-09-18 14:55:47.181214] I [input.c:46:cli_batch] 0-: Exiting with: 0
 [2012-09-18 14:55:49.244795] W
 [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing
 'option transport-type'. defaulting to socket
 [2012-09-18 14:55:49.307054] I
 [cli-rpc-ops.c:5905:gf_cli3_1_heal_volume_cbk] 0-cli: Received resp to
 heal volume
 [2012-09-18 14:55:49.307274] W [dict.c:2339:dict_unserialize]
 (--/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xa5) [0x328ca10365]
 (--/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
 [0x328ca0f965] (--gluster(gf_cli3_1_heal_volume_cbk+0x1d4)
 [0x4225e4]))) 0-dict: buf is null!
 [2012-09-18 14:55:49.307289] E
 [cli-rpc-ops.c:5930:gf_cli3_1_heal_volume_cbk] 0-: Unable to allocate
 memory
 [2012-09-18 14:55:49.307314] I [input.c:46:cli_batch] 0-: Exiting with: -1
 ##

 I'd be happy to provide more if someone requests something specific.

 Not sure what other information to provide at this point, but here's
 the basics of the gluster setup:
 ##
 # gluster volume info all

 Volume Name: gv0
 Type: Replicate
 Volume ID: 200046fc-1b5f-460c-b54b-96932e31ed3c
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: 10.31.99.165:/mnt/sdb1
 Brick2: 10.31.99.166:/mnt/sdb1
 # gluster volume heal gv0 info
 operation failed
 ##

 I just noticed that glusterfs seems to be rapidly heading towards OOM
 territory.  The glusterfs daemon is currently consuming 90% of MEM
 according to top.

 thanks
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS over gluster stops responding under write load

2012-09-20 Thread harry mangalam
Not expected behavior, but one that has been noted before:

Joe Julian posted this only a few days ago - it sounds similar to your 
problem?

https://bugzilla.redhat.com/show_bug.cgi?id=849526


On Thursday, September 20, 2012 11:58:50 AM Lonni J Friedman wrote:
 No replies?  Surely this can't be expected behavior?
 
 On Tue, Sep 18, 2012 at 2:59 PM, Lonni J Friedman netll...@gmail.com 
wrote:
  Greetings,
  I'm running version 3.3.0 on Fedora16-x86_64.  I have two bricks setup
  with a volume doing basic replication on an XFS formatted filesystem.
  I've NFS mounted the volume on a 3rd system, and invoked bonnie++ to
  write to the NFS mount point.  After a few minutes, I noticed that
  bonnie++ didn't seem to be generating any more progress output, at
  which point I started checking assorted logs to see if anything was
  wrong.  At that point, I saw that the client system was no longer able
  to write to the NFS mount point, and dmesg (and /var/log/messages) was
  spewing these warnings like crazy (dozens/second):
  nfs: server 10.31.99.166 not responding, still trying trying
  
  Those warnings started at 14:40:58 on the client system, but oddly
  stopped a few seconds later at 14:41:04.  Here's the full bonnie++
  output (/mnt/gv0 is where the gluster file system is mounted as an NFS
  client):
  
  [root@cuda-ljf0 ~]# bonnie++ -d /mnt/gv0 -u root
  Using uid:0, gid:0.
  Writing a byte at a time...done
  Writing intelligently...done
  Rewriting...
  
  
  Here's what's in the glusterfs logs at the moment:
  ##
  # tail etc-glusterfs-glusterd.vol.log
  [2012-09-18 14:54:39.026557] I
  [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received
  op from uuid: 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
  [2012-09-18 14:54:39.029463] I
  [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd:
  Responded to stage, ret: 0
  [2012-09-18 14:54:46.993426] I
  [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
  0-management: Received heal vol req for volume gv0
  [2012-09-18 14:54:46.993503] E [glusterd-utils.c:277:glusterd_lock]
  0-glusterd: Unable to get lock for uuid:
  e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
  1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
  [2012-09-18 14:54:46.993520] E
  [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
  acquire local lock, ret: -1
  [2012-09-18 14:55:47.175521] I
  [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
  Received get vol req
  [2012-09-18 14:55:47.181048] I
  [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
  Received get vol req
  [2012-09-18 14:55:49.306776] I
  [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
  0-management: Received heal vol req for volume gv0
  [2012-09-18 14:55:49.306834] E [glusterd-utils.c:277:glusterd_lock]
  0-glusterd: Unable to get lock for uuid:
  e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
  1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
  [2012-09-18 14:55:49.306844] E
  [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
  acquire local lock, ret: -1
  # tail -f cli.log
  [2012-09-18 14:55:47.176824] I
  [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
  get vol: 0
  [2012-09-18 14:55:47.180959] I
  [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
  [2012-09-18 14:55:47.181128] I
  [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
  get vol: 0
  [2012-09-18 14:55:47.181167] I
  [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
  [2012-09-18 14:55:47.181214] I [input.c:46:cli_batch] 0-: Exiting with: 0
  [2012-09-18 14:55:49.244795] W
  [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing
  'option transport-type'. defaulting to socket
  [2012-09-18 14:55:49.307054] I
  [cli-rpc-ops.c:5905:gf_cli3_1_heal_volume_cbk] 0-cli: Received resp to
  heal volume
  [2012-09-18 14:55:49.307274] W [dict.c:2339:dict_unserialize]
  (--/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xa5) [0x328ca10365]
  (--/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
  [0x328ca0f965] (--gluster(gf_cli3_1_heal_volume_cbk+0x1d4)
  [0x4225e4]))) 0-dict: buf is null!
  [2012-09-18 14:55:49.307289] E
  [cli-rpc-ops.c:5930:gf_cli3_1_heal_volume_cbk] 0-: Unable to allocate
  memory
  [2012-09-18 14:55:49.307314] I [input.c:46:cli_batch] 0-: Exiting with: -1
  ##
  
  I'd be happy to provide more if someone requests something specific.
  
  Not sure what other information to provide at this point, but here's
  the basics of the gluster setup:
  ##
  # gluster volume info all
  
  Volume Name: gv0
  Type: Replicate
  Volume ID: 200046fc-1b5f-460c-b54b-96932e31ed3c
  Status: Started
  Number of Bricks: 1 x 2 = 2
  Transport-type: tcp
  Bricks:
  Brick1: 10.31.99.165:/mnt/sdb1
  Brick2: 10.31.99.166:/mnt/sdb1
  # gluster volume heal gv0 info
  operation failed
  ##
  
  I just noticed that glusterfs seems to be rapidly 

Re: [Gluster-users] NFS over gluster stops responding under write load

2012-09-20 Thread Lonni J Friedman
Yup, that does look quite similar.

On Thu, Sep 20, 2012 at 12:25 PM, harry mangalam harry.manga...@uci.edu wrote:
 Not expected behavior, but one that has been noted before:

 Joe Julian posted this only a few days ago - it sounds similar to your
 problem?

 https://bugzilla.redhat.com/show_bug.cgi?id=849526


 On Thursday, September 20, 2012 11:58:50 AM Lonni J Friedman wrote:
 No replies?  Surely this can't be expected behavior?

 On Tue, Sep 18, 2012 at 2:59 PM, Lonni J Friedman netll...@gmail.com
 wrote:
  Greetings,
  I'm running version 3.3.0 on Fedora16-x86_64.  I have two bricks setup
  with a volume doing basic replication on an XFS formatted filesystem.
  I've NFS mounted the volume on a 3rd system, and invoked bonnie++ to
  write to the NFS mount point.  After a few minutes, I noticed that
  bonnie++ didn't seem to be generating any more progress output, at
  which point I started checking assorted logs to see if anything was
  wrong.  At that point, I saw that the client system was no longer able
  to write to the NFS mount point, and dmesg (and /var/log/messages) was
  spewing these warnings like crazy (dozens/second):
  nfs: server 10.31.99.166 not responding, still trying trying
 
  Those warnings started at 14:40:58 on the client system, but oddly
  stopped a few seconds later at 14:41:04.  Here's the full bonnie++
  output (/mnt/gv0 is where the gluster file system is mounted as an NFS
  client):
  
  [root@cuda-ljf0 ~]# bonnie++ -d /mnt/gv0 -u root
  Using uid:0, gid:0.
  Writing a byte at a time...done
  Writing intelligently...done
  Rewriting...
  
 
  Here's what's in the glusterfs logs at the moment:
  ##
  # tail etc-glusterfs-glusterd.vol.log
  [2012-09-18 14:54:39.026557] I
  [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received
  op from uuid: 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
  [2012-09-18 14:54:39.029463] I
  [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd:
  Responded to stage, ret: 0
  [2012-09-18 14:54:46.993426] I
  [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
  0-management: Received heal vol req for volume gv0
  [2012-09-18 14:54:46.993503] E [glusterd-utils.c:277:glusterd_lock]
  0-glusterd: Unable to get lock for uuid:
  e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
  1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
  [2012-09-18 14:54:46.993520] E
  [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
  acquire local lock, ret: -1
  [2012-09-18 14:55:47.175521] I
  [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
  Received get vol req
  [2012-09-18 14:55:47.181048] I
  [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
  Received get vol req
  [2012-09-18 14:55:49.306776] I
  [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
  0-management: Received heal vol req for volume gv0
  [2012-09-18 14:55:49.306834] E [glusterd-utils.c:277:glusterd_lock]
  0-glusterd: Unable to get lock for uuid:
  e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
  1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
  [2012-09-18 14:55:49.306844] E
  [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
  acquire local lock, ret: -1
  # tail -f cli.log
  [2012-09-18 14:55:47.176824] I
  [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
  get vol: 0
  [2012-09-18 14:55:47.180959] I
  [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
  [2012-09-18 14:55:47.181128] I
  [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
  get vol: 0
  [2012-09-18 14:55:47.181167] I
  [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
  [2012-09-18 14:55:47.181214] I [input.c:46:cli_batch] 0-: Exiting with: 0
  [2012-09-18 14:55:49.244795] W
  [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing
  'option transport-type'. defaulting to socket
  [2012-09-18 14:55:49.307054] I
  [cli-rpc-ops.c:5905:gf_cli3_1_heal_volume_cbk] 0-cli: Received resp to
  heal volume
  [2012-09-18 14:55:49.307274] W [dict.c:2339:dict_unserialize]
  (--/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xa5) [0x328ca10365]
  (--/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
  [0x328ca0f965] (--gluster(gf_cli3_1_heal_volume_cbk+0x1d4)
  [0x4225e4]))) 0-dict: buf is null!
  [2012-09-18 14:55:49.307289] E
  [cli-rpc-ops.c:5930:gf_cli3_1_heal_volume_cbk] 0-: Unable to allocate
  memory
  [2012-09-18 14:55:49.307314] I [input.c:46:cli_batch] 0-: Exiting with: -1
  ##
 
  I'd be happy to provide more if someone requests something specific.
 
  Not sure what other information to provide at this point, but here's
  the basics of the gluster setup:
  ##
  # gluster volume info all
 
  Volume Name: gv0
  Type: Replicate
  Volume ID: 200046fc-1b5f-460c-b54b-96932e31ed3c
  Status: Started
  Number of Bricks: 1 x 2 = 2
  Transport-type: tcp
  Bricks:
  Brick1: 10.31.99.165:/mnt/sdb1
  Brick2: 10.31.99.166:/mnt/sdb1
  # 

[Gluster-users] NFS over gluster stops responding under write load

2012-09-18 Thread Lonni J Friedman
Greetings,
I'm running version 3.3.0 on Fedora16-x86_64.  I have two bricks setup
with a volume doing basic replication on an XFS formatted filesystem.
I've NFS mounted the volume on a 3rd system, and invoked bonnie++ to
write to the NFS mount point.  After a few minutes, I noticed that
bonnie++ didn't seem to be generating any more progress output, at
which point I started checking assorted logs to see if anything was
wrong.  At that point, I saw that the client system was no longer able
to write to the NFS mount point, and dmesg (and /var/log/messages) was
spewing these warnings like crazy (dozens/second):
nfs: server 10.31.99.166 not responding, still trying trying

Those warnings started at 14:40:58 on the client system, but oddly
stopped a few seconds later at 14:41:04.  Here's the full bonnie++
output (/mnt/gv0 is where the gluster file system is mounted as an NFS
client):

[root@cuda-ljf0 ~]# bonnie++ -d /mnt/gv0 -u root
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...


Here's what's in the glusterfs logs at the moment:
##
# tail etc-glusterfs-glusterd.vol.log
[2012-09-18 14:54:39.026557] I
[glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received
op from uuid: 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
[2012-09-18 14:54:39.029463] I
[glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd:
Responded to stage, ret: 0
[2012-09-18 14:54:46.993426] I
[glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
0-management: Received heal vol req for volume gv0
[2012-09-18 14:54:46.993503] E [glusterd-utils.c:277:glusterd_lock]
0-glusterd: Unable to get lock for uuid:
e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
[2012-09-18 14:54:46.993520] E
[glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
acquire local lock, ret: -1
[2012-09-18 14:55:47.175521] I
[glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
Received get vol req
[2012-09-18 14:55:47.181048] I
[glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
Received get vol req
[2012-09-18 14:55:49.306776] I
[glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
0-management: Received heal vol req for volume gv0
[2012-09-18 14:55:49.306834] E [glusterd-utils.c:277:glusterd_lock]
0-glusterd: Unable to get lock for uuid:
e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
[2012-09-18 14:55:49.306844] E
[glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
acquire local lock, ret: -1
# tail -f cli.log
[2012-09-18 14:55:47.176824] I
[cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
get vol: 0
[2012-09-18 14:55:47.180959] I
[cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
[2012-09-18 14:55:47.181128] I
[cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
get vol: 0
[2012-09-18 14:55:47.181167] I
[cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
[2012-09-18 14:55:47.181214] I [input.c:46:cli_batch] 0-: Exiting with: 0
[2012-09-18 14:55:49.244795] W
[rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing
'option transport-type'. defaulting to socket
[2012-09-18 14:55:49.307054] I
[cli-rpc-ops.c:5905:gf_cli3_1_heal_volume_cbk] 0-cli: Received resp to
heal volume
[2012-09-18 14:55:49.307274] W [dict.c:2339:dict_unserialize]
(--/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xa5) [0x328ca10365]
(--/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
[0x328ca0f965] (--gluster(gf_cli3_1_heal_volume_cbk+0x1d4)
[0x4225e4]))) 0-dict: buf is null!
[2012-09-18 14:55:49.307289] E
[cli-rpc-ops.c:5930:gf_cli3_1_heal_volume_cbk] 0-: Unable to allocate
memory
[2012-09-18 14:55:49.307314] I [input.c:46:cli_batch] 0-: Exiting with: -1
##

I'd be happy to provide more if someone requests something specific.

Not sure what other information to provide at this point, but here's
the basics of the gluster setup:
##
# gluster volume info all

Volume Name: gv0
Type: Replicate
Volume ID: 200046fc-1b5f-460c-b54b-96932e31ed3c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.31.99.165:/mnt/sdb1
Brick2: 10.31.99.166:/mnt/sdb1
# gluster volume heal gv0 info
operation failed
##

I just noticed that glusterfs seems to be rapidly heading towards OOM
territory.  The glusterfs daemon is currently consuming 90% of MEM
according to top.

thanks
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] NFS over gluster stops responding under write load

2012-09-18 Thread Lonni J Friedman
On Tue, Sep 18, 2012 at 2:59 PM, Lonni J Friedman netll...@gmail.com wrote:
 Greetings,
 I'm running version 3.3.0 on Fedora16-x86_64.  I have two bricks setup
 with a volume doing basic replication on an XFS formatted filesystem.
 I've NFS mounted the volume on a 3rd system, and invoked bonnie++ to
 write to the NFS mount point.  After a few minutes, I noticed that
 bonnie++ didn't seem to be generating any more progress output, at
 which point I started checking assorted logs to see if anything was
 wrong.  At that point, I saw that the client system was no longer able
 to write to the NFS mount point, and dmesg (and /var/log/messages) was
 spewing these warnings like crazy (dozens/second):
 nfs: server 10.31.99.166 not responding, still trying trying

 Those warnings started at 14:40:58 on the client system, but oddly
 stopped a few seconds later at 14:41:04.  Here's the full bonnie++
 output (/mnt/gv0 is where the gluster file system is mounted as an NFS
 client):
 
 [root@cuda-ljf0 ~]# bonnie++ -d /mnt/gv0 -u root
 Using uid:0, gid:0.
 Writing a byte at a time...done
 Writing intelligently...done
 Rewriting...
 

 Here's what's in the glusterfs logs at the moment:
 ##
 # tail etc-glusterfs-glusterd.vol.log
 [2012-09-18 14:54:39.026557] I
 [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received
 op from uuid: 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
 [2012-09-18 14:54:39.029463] I
 [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd:
 Responded to stage, ret: 0
 [2012-09-18 14:54:46.993426] I
 [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
 0-management: Received heal vol req for volume gv0
 [2012-09-18 14:54:46.993503] E [glusterd-utils.c:277:glusterd_lock]
 0-glusterd: Unable to get lock for uuid:
 e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
 [2012-09-18 14:54:46.993520] E
 [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
 acquire local lock, ret: -1
 [2012-09-18 14:55:47.175521] I
 [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
 Received get vol req
 [2012-09-18 14:55:47.181048] I
 [glusterd-handler.c:860:glusterd_handle_cli_get_volume] 0-glusterd:
 Received get vol req
 [2012-09-18 14:55:49.306776] I
 [glusterd-volume-ops.c:492:glusterd_handle_cli_heal_volume]
 0-management: Received heal vol req for volume gv0
 [2012-09-18 14:55:49.306834] E [glusterd-utils.c:277:glusterd_lock]
 0-glusterd: Unable to get lock for uuid:
 e9ce949d-8521-4868-ad1b-860e0ffd8768, lock held by:
 1d3fb6c7-f5eb-42e9-b2bc-48bd3ed09e62
 [2012-09-18 14:55:49.306844] E
 [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to
 acquire local lock, ret: -1
 # tail -f cli.log
 [2012-09-18 14:55:47.176824] I
 [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
 get vol: 0
 [2012-09-18 14:55:47.180959] I
 [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
 [2012-09-18 14:55:47.181128] I
 [cli-rpc-ops.c:479:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to
 get vol: 0
 [2012-09-18 14:55:47.181167] I
 [cli-rpc-ops.c:732:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0
 [2012-09-18 14:55:47.181214] I [input.c:46:cli_batch] 0-: Exiting with: 0
 [2012-09-18 14:55:49.244795] W
 [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing
 'option transport-type'. defaulting to socket
 [2012-09-18 14:55:49.307054] I
 [cli-rpc-ops.c:5905:gf_cli3_1_heal_volume_cbk] 0-cli: Received resp to
 heal volume
 [2012-09-18 14:55:49.307274] W [dict.c:2339:dict_unserialize]
 (--/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xa5) [0x328ca10365]
 (--/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)
 [0x328ca0f965] (--gluster(gf_cli3_1_heal_volume_cbk+0x1d4)
 [0x4225e4]))) 0-dict: buf is null!
 [2012-09-18 14:55:49.307289] E
 [cli-rpc-ops.c:5930:gf_cli3_1_heal_volume_cbk] 0-: Unable to allocate
 memory
 [2012-09-18 14:55:49.307314] I [input.c:46:cli_batch] 0-: Exiting with: -1
 ##

 I'd be happy to provide more if someone requests something specific.

 Not sure what other information to provide at this point, but here's
 the basics of the gluster setup:
 ##
 # gluster volume info all

 Volume Name: gv0
 Type: Replicate
 Volume ID: 200046fc-1b5f-460c-b54b-96932e31ed3c
 Status: Started
 Number of Bricks: 1 x 2 = 2
 Transport-type: tcp
 Bricks:
 Brick1: 10.31.99.165:/mnt/sdb1
 Brick2: 10.31.99.166:/mnt/sdb1
 # gluster volume heal gv0 info
 operation failed
 ##

 I just noticed that glusterfs seems to be rapidly heading towards OOM
 territory.  The glusterfs daemon is currently consuming 90% of MEM
 according to top.

I just attempted to shutdown the glusterd service, and it ran off a
cliff.  the OOM killer kicked in and killed it.  From dmesg:
#
[ 4151.733182] glusterfsd invoked oom-killer: gfp_mask=0x201da,
order=0, oom_adj=0, oom_score_adj=0
[ 4151.733186] glusterfsd cpuset=/ mems_allowed=0
[ 4151.733189] Pid: 2567, comm: glusterfsd