Re: [Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour - SOLVED
All these problems disappeared with a client unmount / remount of the gluster filesystem. A remarkably simple fix for such a bizarre set of symptoms. We'll see how systemic the fix is, but all cluster nodes that have had that fix applied are now behaving normally (AFAICT) and the 2 that have not (due to long-running jobs writing to new files on existing dirs) still have the otherwise odd behavior, previously described in excruciating detail. Maybe this should be added to the HOWTO/DOTHISINCASEOFEMERGENCY doc. hjm --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour
On Saturday, January 04, 2014 10:45:29 PM Vijay Bellur wrote: rdma.so seems to be missing here. Is glusterfs-rdma-3.4.2-1 rpm installed on the servers? It's not. The original gluster (3.2, I think) was set up with RDMA and IP transport but RDMA was never instantiated and it's been working fine without it (except for the zillions of repeating errors). hjm --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour
Also some other anomalies. Even when the files are visible and readable, many dirs are unwritable and/or undeleteable. for example: Sat Jan 04 18:36:17 [0.02 0.08 0.12] root@hpc-s:/bio/mmacchie 1104 $ mkdir hjmtest mkdir: cannot create directory `hjmtest': Invalid argument Sat Jan 04 18:36:23 [0.02 0.08 0.12] root@hpc-s:/bio/mmacchie The client log says this for that operation (note offset times - UTC vs local: http://pastie.org/8602365 And in many subdirs, other dirs can be made, but not deleted: Sat Jan 04 18:41:45 [0.00 0.04 0.09] root@hpc- s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered 1109 $ mkdir j1 Sat Jan 04 18:42:00 [0.00 0.03 0.09] root@hpc- s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered 1110 $ rmdir j1 rmdir: failed to remove `j1': Transport endpoint is not connected Sat Jan 04 18:42:09 [0.08 0.05 0.09] root@hpc- s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered With the client log saying: [2014-01-05 02:42:09.548263] W [client-rpc-fops.c:526:client3_3_stat_cbk] 0- gl-client-2: remote operation failed: Transport endpoint is not connected [2014-01-05 02:42:09.549314] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected. Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 (aebbf21f-37fe-4edc-be8a-0f57b057b516) [2014-01-05 02:42:09.550124] W [client-rpc-fops.c:2541:client3_3_opendir_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected. Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 (aebbf21f-37fe-4edc-be8a-0f57b057b516) [2014-01-05 02:42:09.552439] W [fuse-bridge.c:1193:fuse_unlink_cbk] 0- glusterfs-fuse: 5805445: RMDIR() /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 = -1 (Transport endpoint is not connected) [2014-01-05 02:42:12.175860] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) [2014-01-05 02:42:15.181365] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) [2014-01-05 02:42:18.186668] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) This is odd - how can a dir be created OK but then the fs lose track of it to delete it? And that dir (j1) can have /files/ created and deleted inside of it, but not other /dirs/ (same result as the parent dir). In looking thru the client log, I see instances of this: [2014-01-05 02:27:20.721043] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected. Path: /bio/mmacchie/Nematodes (----) [2014-01-05 02:27:20.769058] I [dht-layout.c:630:dht_layout_normalize] 0-gl- dht: found anomalies in /bio/mmacchie/Nematodes. holes=2 overlaps=0 [2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0- gl-dht: 1 subvolumes down -- not fixing [2014-01-05 02:27:20.784335] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: more at: http://pastie.org/8602381 alarming since it says: [2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0- gl-dht: 1 subvolumes down -- not fixing All my servers and bricks appear to be up and online: Sat Jan 04 18:54:09 [0.76 0.30 0.20] root@biostor1:~ 1003 $ gluster volume status gl detail | egrep Brick|Online Brick: Brick bs2:/raid1 Online : Y Brick: Brick bs2:/raid2 Online : Y Brick: Brick bs3:/raid1 Online : Y Brick: Brick bs3:/raid2 Online : Y Brick: Brick bs4:/raid1 Online : Y Brick: Brick bs4:/raid2 Online : Y Brick: Brick bs1:/raid1 Online : Y Brick: Brick bs1:/raid2 Online : Y The gluster server logs seem to be fairly quiet thru this. the followig contains the logs for the last day or so from the 4 servers, reduced by the following command to eliminate the 'socket.c:2788' errors grep -v socket.c:2788 /var/log/glusterfs/etc-glusterfs-glusterd.vol.log http://pastie.org/8602412 hjm On Saturday, January 04, 2014 10:45:29 PM Vijay Bellur wrote: On 01/04/2014 07:21 AM, harry mangalam wrote: This is a distributed-only glusterfs on 4 servers with 2 bricks each on an IPoIB network. Thanks to a misconfigured autoupdate script, when 3.4.2 was released today, my gluster servers tried to update themselves. 2 succeeded, but then failed to restart, the other 2 failed to update and kept running. Not realizing the sequence of events, I restarted the 2 that failed to restart, which gave
[Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour
This is a distributed-only glusterfs on 4 servers with 2 bricks each on an IPoIB network. Thanks to a misconfigured autoupdate script, when 3.4.2 was released today, my gluster servers tried to update themselves. 2 succeeded, but then failed to restart, the other 2 failed to update and kept running. Not realizing the sequence of events, I restarted the 2 that failed to restart, which gave my fs 2 servers running 3.4.1 and 2 running 3.4.2. When I realized this after about 30m, I shut everything down and updated the 2 remaining to 3.4.2 and then restarted but now I'm getting lots of reports of file errors of the type 'endpoints not connected' and the like: [2014-01-04 01:31:18.593547] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/test_cuffdiff.sh (----) [2014-01-04 01:31:18.594928] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/test_cuffdiff.sh (----) [2014-01-04 01:31:18.595818] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/.#test_cuffdiff.sh (14c3b612-e952-4aec- ae18-7f3dbb422dcc) [2014-01-04 01:31:18.597381] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/test_cuffdiff.sh (----) [2014-01-04 01:31:18.598212] W [client-rpc-fops.c:814:client3_3_statfs_cbk] 0- gl-client-2: remote operation failed: Transport endpoint is not connected [2014-01-04 01:31:18.598236] W [dht-diskusage.c:45:dht_du_info_cbk] 0-gl-dht: failed to get disk info from gl-client-2 [2014-01-04 01:31:19.912210] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) [2014-01-04 01:31:22.912717] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) [2014-01-04 01:31:25.913208] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) The servers at the same time provided the following error 'E' messages: Fri Jan 03 17:46:42 [0.20 0.12 0.13] root@biostor1:~ 1008 $ grep ' E ' /var/log/glusterfs/bricks/raid1.log |grep '2014-01-03' [2014-01-03 06:11:36.251786] E [server-helpers.c:751:server_alloc_frame] (-- /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103) [0x3161e090d3] (-- /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245) [0x3161e08f85] (-- /usr/lib64/glusterfs/3.4.1/xlator/protocol/server.so(server3_3_lookup+0xa0) [0x7fa60e577170]))) 0-server: invalid argument: conn [2014-01-03 06:11:36.251813] E [rpcsvc.c:450:rpcsvc_check_and_reply_error] 0- rpcsvc: rpc actor failed to complete successfully [2014-01-03 17:48:44.236127] E [rpc-transport.c:253:rpc_transport_load] 0-rpc- transport: /usr/lib64/glusterfs/3.4.1/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2014-01-03 19:15:26.643378] E [rpc-transport.c:253:rpc_transport_load] 0-rpc- transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory The missing/misbehaving files /are/ accessible on the individual bricks but not thru gluster. This is a distributed-only setup, not replicated, so it seems like the gluster volume heal volume is appropriate. Do the gluster wizards agree? --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster fails under heavy array job load load
Bug 1043009 Submitted On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote: Please provide the full client and server logs (in a bug report). The snippets give some hints, but are not very meaningful without the full context/history since mount time (they have after-the-fact symptoms, but not the part which show the reason why disconnects happened). Even before looking into the full logs here are some quick observations: - write-behind-window-size = 1024MB seems *excessively* high. Please set this to 1MB (default) and check if the stability improves. - I see RDMA is enabled on the volume. Are you mounting clients through RDMA? If so, for the purpose of diagnostics can you mount through TCP and check the stability improves? If you are using RDMA with such a high write-behind-window-size, spurious ping-timeouts are an almost certainty during heavy writes. The RDMA driver has limited flow control, and setting such a high window-size can easily congest all the RDMA buffers resulting in spurious ping-timeouts and disconnections. Avati On Thu, Dec 12, 2013 at 5:03 PM, harry mangalam harry.manga...@uci.eduwrote: Hi All, (Gluster Volume Details at bottom) I've posted some of this previously, but even after various upgrades, attempted fixes, etc, it remains a problem. Short version: Our gluster fs (~340TB) provides scratch space for a ~5000core academic compute cluster. Much of our load is streaming IO, doing a lot of genomics work, and that is the load under which we saw this latest failure. Under heavy batch load, especially array jobs, where there might be several 64core nodes doing I/O on the 4servers/8bricks, we often get job failures that have the following profile: Client POV: Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all compute nodes that indicated interaction with the user's files http://pastie.org/8548781 Here are some client Info logs that seem fairly serious: http://pastie.org/8548785 The errors that referenced this user were gathered from all the nodes that were running his code (in compute*) and agglomerated with: cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr and placed here to show the profile of errors that his run generated. http://pastie.org/8548796 so 71 of them were: W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote operation failed: Transport endpoint is not connected. etc We've seen this before and previously discounted it bc it seems to have been related to the problem of spurious NFS-related bugs, but now I'm wondering whether it's a real problem. Also the 'remote operation failed: Stale file handle. ' warnings. There were no Errors logged per se, tho some of the W's looked fairly nasty, like the 'dht_layout_dir_mismatch' From the server side, however, during the same period, there were: 0 Warnings about this user's files 0 Errors 458 Info lines of which only 1 line was not a 'cleanup' line like this: --- 10.2.7.11:[2013-12-12 21:22:01.064289] I [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on /path/to/file --- it was: --- 10.2.7.14:[2013-12-12 21:00:35.209015] I [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030 (c9488341-c063-4175-8492-75e2e282f690) == trusted.glusterfs.dht --- We're losing about 10% of these kinds of array jobs bc of this, which is just not supportable. Gluster details servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4 $ gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* 'gluster volume status gl detail': http://pastie.org/8548826 --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps
Re: [Gluster-users] gluster fails under heavy array job load load
On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote: - I see RDMA is enabled on the volume. Are you mounting clients through RDMA? If so, for the purpose of diagnostics can you mount through TCP and check the stability improves? If you are using RDMA with such a high write-behind-window-size, spurious ping-timeouts are an almost certainty during heavy writes. The RDMA driver has limited flow control, and setting such a high window-size can easily congest all the RDMA buffers resulting in spurious ping-timeouts and disconnections. Is there a way to remove the RDMA transport option once it is enabled. I was under the impression that our system was NOT using RDMA, but from the logs, I see the following that implies that they /are/ using RDMA now. == 10.2.7.11 == 4: option transport-type socket,rdma [2013-12-10 17:42:12.498076] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid1.rdma on port 49153 [2013-12-10 17:42:15.571287] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid2.rdma on port 49155 == 10.2.7.12 == 4: option transport-type socket,rdma [2013-12-10 17:42:17.974841] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid1.rdma on port 49153 [2013-12-10 17:42:21.266486] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid2.rdma on port 49155 == 10.2.7.13 == 4: option transport-type socket,rdma [2013-12-10 17:42:17.929753] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid1.rdma on port 49153 [2013-12-10 17:42:21.646482] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid2.rdma on port 49155 == 10.2.7.14 == 4: option transport-type socket,rdma [2013-12-10 17:42:15.791176] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid1.rdma on port 49153 [2013-12-10 17:42:15.941182] I [glusterd-pmap.c:227:pmap_registry_bind] 0- pmap: adding brick /raid2.rdma on port 49155 --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] gluster fails under heavy array job load load
Hi Alex, Thanks for taking the time to think about this. I don't have metrics at hand, but I tend to think not for 2 reasons. - when I have looked at stats from the network, it has never been close to saturating - the bottlenecks appear to be most at the gluster server side. I get emailed if my servers go above a load of 8 (the servers have 8 cores) and when that happens, I often get complaints from users that they've had incomplete runs. At these points the network load is often fairly high (1GB/s, aggregate), but on a QDR network, that shouldn't be saturating. - the same jobs, when run using another distributed FS, on the same IB fabric, have no such behavior, which would tend to point the fault at gluster or (granted) my configuration of it. - while a lot of the IO load is large streaming RW, there are a subsection of jobs that users insist on using Zillions of Tiny (ZOT) files as output - they use the file names for indices or as table row entries. (One user had 20M files in a tree). We're trying to educate them, but it takes time and energy. Gluster seems to have a lot of trouble traversing these huge file fields, moreso than DFSs that use metadata servers. That said, it has been stable otherwise and there are a lot of things to recommend it. hjm On Friday, December 13, 2013 02:00:19 PM Alex Chekholko wrote: Hi Harry, My best guess is that you overloaded your interconnect. Do you have metrics for if/when your network was saturated? That would cause Gluster clients to time out. My best guess is that you went into the E state of your USE (Utilization, Saturation, Error) spectrum. IME, that is a common pattern for out Lustre/GPFS clients, you get all kinds of weird error states if you manage to saturate your I/O for an extended period of time and fill all of the buffers everywhere. Regards, Alex On 12/12/2013 05:03 PM, harry mangalam wrote: Short version: Our gluster fs (~340TB) provides scratch space for a ~5000core academic compute cluster. Much of our load is streaming IO, doing a lot of genomics work, and that is the load under which we saw this latest failure. --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] gluster fails under heavy array job load load
Hi All, (Gluster Volume Details at bottom) I've posted some of this previously, but even after various upgrades, attempted fixes, etc, it remains a problem. Short version: Our gluster fs (~340TB) provides scratch space for a ~5000core academic compute cluster. Much of our load is streaming IO, doing a lot of genomics work, and that is the load under which we saw this latest failure. Under heavy batch load, especially array jobs, where there might be several 64core nodes doing I/O on the 4servers/8bricks, we often get job failures that have the following profile: Client POV: Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all compute nodes that indicated interaction with the user's files http://pastie.org/8548781 Here are some client Info logs that seem fairly serious: http://pastie.org/8548785 The errors that referenced this user were gathered from all the nodes that were running his code (in compute*) and agglomerated with: cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr and placed here to show the profile of errors that his run generated. http://pastie.org/8548796 so 71 of them were: W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote operation failed: Transport endpoint is not connected. etc We've seen this before and previously discounted it bc it seems to have been related to the problem of spurious NFS-related bugs, but now I'm wondering whether it's a real problem. Also the 'remote operation failed: Stale file handle. ' warnings. There were no Errors logged per se, tho some of the W's looked fairly nasty, like the 'dht_layout_dir_mismatch' From the server side, however, during the same period, there were: 0 Warnings about this user's files 0 Errors 458 Info lines of which only 1 line was not a 'cleanup' line like this: --- 10.2.7.11:[2013-12-12 21:22:01.064289] I [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on /path/to/file --- it was: --- 10.2.7.14:[2013-12-12 21:00:35.209015] I [server-rpc- fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030 (c9488341-c063-4175-8492-75e2e282f690) == trusted.glusterfs.dht --- We're losing about 10% of these kinds of array jobs bc of this, which is just not supportable. Gluster details servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4 $ gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* 'gluster volume status gl detail': http://pastie.org/8548826 --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?
To confirm: Joe's explanation, uncomfortable as it, looks to be the correct one. When the servers were powered off and restarted (so the gluster processes /had/ to be restarted, the new ones started up and used the correct time format. The 'problem' clients were the ones which were running the updated version; when all the clients were forced to restart the glusterfs; they all appear to be running with the UTC time (tho hard to tell since the number of logged incidents has fallen dramatically). There is a repeating entry in all the server logs tho: 1001 $ tail -3 /var/log/glusterfs/etc-glusterfs-glusterd.vol.log [2013-12-11 18:26:41.824180] E [socket.c:2788:socket_connect] 0-management: connection attempt failed (Connection refused) [2013-12-11 18:26:44.825089] E [socket.c:2788:socket_connect] 0-management: connection attempt failed (Connection refused) [2013-12-11 18:26:47.825952] E [socket.c:2788:socket_connect] 0-management: connection attempt failed (Connection refused) Is there a way to detect which client(s) this is coming from? On Tuesday, December 10, 2013 11:23:11 AM Joe Julian wrote: If I were to hazard a guess, since the timestamp is not configurable and *is* UTC in 3.4, it would seem that any server that's logging in local time must not be running 3.4. Sure, it's installed, but the application hasn't been restarted since it was installed. That's the only thing I can think of that would allow that behavior. --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?
On Tuesday, December 10, 2013 12:49:25 PM Sharuzzaman Ahmat Raslan wrote: Hi Harry, Did you setup ntp on each of the node, and sync the time to one single source? Yes, this is done by ROCKS and all the nodes have the identical time. (2admins have checked repeatedly) Thanks. On Tue, Dec 10, 2013 at 12:44 PM, harry mangalam harry.manga...@uci.eduwrote: Admittedly I should search the source, but I wonder if anyone knows this offhand. Background: of our 84 ROCKS (6.1) -provisioned compute nodes, 4 have picked up an 'advanced date' in the /var/log/glusterfs/gl.log file - that date string is running about 5-6 hours ahead of the system date and all the Gluster servers (which are identical and correct). The time advancement does not appear to be identical tho it's hard to tell since it only shows on errors and those update irregularly. All the clients are the same version and all the servers are the same (gluster v 3.4.0-8.el6.x86_64 This would not be of interest except that those 4 clients are losing files, unable to reliably do IO, etc on the gluster fs. They don't appear to be having problems with NFS mounts, nor with a Fraunhofer FS that is also mounted on each node, Rebooting 2 of them has no effect - they come right back with an advanced date. --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?
On Tuesday, December 10, 2013 10:42:28 AM Vijay Bellur wrote: On 12/10/2013 10:14 AM, harry mangalam wrote: Admittedly I should search the source, but I wonder if anyone knows this offhand. Background: of our 84 ROCKS (6.1) -provisioned compute nodes, 4 have picked up an 'advanced date' in the /var/log/glusterfs/gl.log file - that date string is running about 5-6 hours ahead of the system date and all the Gluster servers (which are identical and correct). The time advancement does not appear to be identical tho it's hard to tell since it only shows on errors and those update irregularly. The timestamps in the log file are by default in UTC. That could possibly explain why the timestamps look advanced in the log file. that seems to make sense. The advanced time on the 4 problems nodes looks to be the correct UTC time, but the others are using /local time/ in their logs, for some reason. And localtime nodes are the ones NOT having problems. ...??! However, this looks to be more of a ROCKS / config problem than a general gluster problem at this point. All the nodes have the md5-identical /etc/localtime, but they seem to be behaving differently as to the logging. Thanks for the pointer. hjm All the clients are the same version and all the servers are the same (gluster v 3.4.0-8.el6.x86_64 This would not be of interest except that those 4 clients are losing files, unable to reliably do IO, etc on the gluster fs. They don't appear to be having problems with NFS mounts, nor with a Fraunhofer FS that is also mounted on each node, Do you observe anything in the client log files of these machines that indicate I/O problems? Yes. Thanks, Vijay ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Where does the 'date' string in '/var/log/glusterfs/gl.log' come from?
Admittedly I should search the source, but I wonder if anyone knows this offhand. Background: of our 84 ROCKS (6.1) -provisioned compute nodes, 4 have picked up an 'advanced date' in the /var/log/glusterfs/gl.log file - that date string is running about 5-6 hours ahead of the system date and all the Gluster servers (which are identical and correct). The time advancement does not appear to be identical tho it's hard to tell since it only shows on errors and those update irregularly. All the clients are the same version and all the servers are the same (gluster v 3.4.0-8.el6.x86_64 This would not be of interest except that those 4 clients are losing files, unable to reliably do IO, etc on the gluster fs. They don't appear to be having problems with NFS mounts, nor with a Fraunhofer FS that is also mounted on each node, Rebooting 2 of them has no effect - they come right back with an advanced date. --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Slow metadata
This is a widely perceived feature/bug of gluster. It also affects other distributed filesystems, tho generally not as much. We've done 2 things to address this. One is a distributed 'du' that is clusterfork'ed out to the storage nodes and compiles the results. This is realtime and will provide data to that point. If you're interested in it, let me know and I can provide the code to do this. However, it requires clusterfork, some per-site config, and is specific to 'du', altho it could be modified to support other shell commands. Here's the difference in performance on a fairly busy gluster system (4 storage nodes, 8 volumes, 340TB, 60% used) = 14:54:09 root@hpc-s:/som 1226 $ time du -sh abusch/* 694Mabusch/MATS-gtf ^C real3m58.098s --- killed after ~4m user0m0.033s sys 0m0.351s 14:58:24 root@hpc-s:/som 1227 $ gfdu abusch/\* INFO: Corrected gluster starting path: [/som/abusch/*] About to execute [/root/bin/cf --script --tar=GLSRV du -s /raid1/som/abusch/* ; du -s /raid2/som/abusch/*; ] Go? [yN]y INFO: For raw results [cd /root/cf/CF-du--s--raid1-som- abu-14.58.38_2013-10-08] Size: File|Dir 693.8203 M /som/abusch/MATS-gtf 1.5292 G /som/abusch/MISO-gffs 764.5117 M /som/abusch/MISO-gffs-v2 23.8720 G /som/abusch/deepSeq 25.2845 G /som/abusch/genomes 5.4239 G /som/abusch/index 16.8011 G /som/abusch/index2 --- 74.3348 G Total time was ~4s = The other approach is with the RobinHood Policy Engine http://sourceforge.net/apps/trac/robinhood which runs on a cron and recurses thru your FS, taking X hours, but compiles that info into a MySQL DB that is instantly responsive (but could be slightly out of date). NTL, it's a very helpful tool to detect hotspots and ZOTfiles (Zillions Of Tiny files) We are using it to monitor NFS volumes, Gluster, and Fraunhofer FSs. It is a very slick system and a student (Adam Brenner) is modifying it to generate better stats via the web interface. See his github and the robinhood trac: https://github.com/abrenner/robinhood-multifs-web http://sourceforge.net/apps/trac/robinhood On Tuesday, October 08, 2013 09:07:52 AM Anders Salling Andersen wrote: Hi all i have a 50tb glusterfs replicated setup, with Many small files. My metadata is very slow ex. Du -sh takes over 24 hours. Is there a Way to make faster metadata ? Regards Anders. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] working inifiniband HW ??
I concur - I used the same cards on our test cluster - just make sure you upgrade the firmware to the latest revision: http://www.mellanox.com/content/pages.php?pg=custom_firmware_table The fw upgrade is not trivial - well, it IS trivial, but bringing all the info together was not - if you want some rough notes, let me know. the other harry On Mon, Jul 8, 2013 at 5:23 AM, Justin Clift jcl...@redhat.com wrote: On 08/07/2013, at 10:20 AM, HL wrote: I am currently testing glusterfs on a small scale non-production env with ordinary nics. I would like to purchase a couple of inifiniband nics in order to connect 3 servers in point to point mode that is with no switches in between. Since I've noticed that some of you have this kind of H/W Any info on brands/models and a good known to work setup will be highly appreciated. Depends on the kind of performance you're after. :) If you're just wanting something better than 10GbE at minimal cost, these work fine under Linux: http://www.ebay.co.uk/itm/360657396651 ~US$40 each, not counting postage They're rebadged Mellanox MHGH28-XTC cards. Completely ok to be flashed with standard Mellanox firmware. For cables, I use these: http://www.ebay.co.uk/itm/251200441924 US$14 each, not counting postage Note, I'm comfortable getting stuff off eBay where the seller seems ok. So far, so good. ;) If you do decide to use the above stuff, and later on want a switch, try and find a Voltaire ISR 9024D-M. They're DDR infiniband (20Gb/s per port), and can run fanless (completely silent). http://www.ebay.co.uk/itm/350827346490 They can be had pretty cheaply if you're willing to wait for good pricing. I got mine for ~US$270. Hope that helps. :) Regards and best wishes, Justin Clift Regards, Harry ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users -- Data backup by the NSA - your tax dollars at work. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Glusterfs 3.3 rapidly generating write errors under heavy load.
Online : Y Pid : 2961 File System : xfs Device : /dev/sdd Mount Options: rw,noatime,sunit=512,swidth=7680,allocsize=32m Inode Size : 256 Disk Space Free : 24.1TB Total Disk Space : 40.9TB Inode Count : 8789028864 Free Inodes : 8786101010 -- Brick: Brick bs1:/raid1 Port : 24013 Online : Y Pid : 3043 File System : xfs Device : /dev/sdc Mount Options: rw,noatime,sunit=512,swidth=8192,allocsize=32m Inode Size : 256 Disk Space Free : 29.1TB Total Disk Space : 43.7TB Inode Count : 9374964096 Free Inodes : 9372036362 -- Brick: Brick bs1:/raid2 Port : 24015 Online : Y Pid : 3049 File System : xfs Device : /dev/sdd Mount Options: rw,noatime,sunit=512,swidth=7680,allocsize=32m Inode Size : 256 Disk Space Free : 25.9TB Total Disk Space : 40.9TB Inode Count : 8789028864 Free Inodes : 8786101382 --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- A Message From a Dying Veteran http://goo.gl/tTHdo ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env
Sending this again, since I'm not even sure that the 1st made it to the list and it's just happened again, even with the same user (one of the heaviest users, but I don't think there's anything odd about his usage). In the last 3 days, we've had 6 such errors (resulting in the logged error: E [posix.c:1730:posix_create] 0-gl-posix: setting gfid on [file] failed An question that could be answered is: has anyone had such errors in their brick logs show up? ie: grep -n posix.c:1730:posix_create /var/log/glusterfs/bricks/raid[12].log hjm === previously === We have a ~2500core academic cluster with saturating amounts of use. The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 filesystem. All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers The servers are all running SL6.2 and are stable, with load running stably at about 2 continuously. gluster is config'ed as: Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* Many of our users run large array jobs under SGE and especially during those runs where there is LOTS of IO, we will VERY occasionally (20 times since last June, according to brick logs) see these kinds of errors, resulting in the failure of that particular element of the array job. Sometimes these are acceptable, but often the next job depends on all elements of the array job to complete correctly. At any rate, from the fs POV they should all complete. The rarity of this error and the type of error, and where it is located suggest that it might be a hash collision..? According to gluster bugzilla this doesn't seem to be a registered bug, so here I am asking if this has been seen by others and how this might be addressed. = The error below being reported by Grid Engine says: user root 03/21/2013 15:29:23 [507:26777]: error: can't open output file /gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2 54058.103: Permission denied 03/21/2013 15:29:23 [400:25458]: wait3 = Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs- glusterd.vol.log), reveals nothing about this error, but the brick logs yeild this set of lines referencing that file at the correct time: /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix- handle.c:461:posix_handle_hard] 0-gl-posix: link /raid1/bio/krthornt/WTCCC/autosomal_ analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 - /raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File exists) /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E [posix.c:1730:posix_create] 0-gl-posix: setting gfid on /raid1/bio/krthornt/WTCCC/autosomal_ analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1- fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN /bio/krthornt/WTCCC/autoso mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6- e35d-4368-841e-b45802fecc4e) == -1 (Permission denied) /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1- fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN /bio/krthornt/WTCCC/autoso mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6- e35d-4368-841e-b45802fecc4e) == -1 (Permission denied) --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env
We have a ~2500core academic cluster with saturating amounts of use. The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 filesystem. All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers The servers are all running SL6.2 and are stable, with load running stably at about 2 continuously. gluster is config'ed as: Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* Many of our users run large array jobs under SGE and especially during those runs where there is LOTS of IO, we will VERY occasionally (20 times since last June, according to brick logs) see these kinds of errors, resulting in the failure of that particular element of the array job. Sometimes these are acceptable, but often the next job depends on all elements of the array job to complete correctly. At any rate, from the fs POV they should all complete. The rarity of this error and the type of error, and where it is located suggest that it might be a hash collision..? According to gluster bugzilla this doesn't seem to be a registered bug, so here I am asking if this has been seen by others and how this might be addressed. = The error below being reported by Grid Engine says: user root 03/21/2013 15:29:23 [507:26777]: error: can't open output file /gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2 54058.103: Permission denied 03/21/2013 15:29:23 [400:25458]: wait3 = Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs- glusterd.vol.log), reveals nothing about this error, but the brick logs yeild this set of lines referencing that file at the correct time: /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix- handle.c:461:posix_handle_hard] 0-gl-posix: link /raid1/bio/krthornt/WTCCC/autosomal_ analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 - /raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File exists) /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E [posix.c:1730:posix_create] 0-gl-posix: setting gfid on /raid1/bio/krthornt/WTCCC/autosomal_ analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1- fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN /bio/krthornt/WTCCC/autoso mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6- e35d-4368-841e-b45802fecc4e) == -1 (Permission denied) /var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1- fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN /bio/krthornt/WTCCC/autoso mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6- e35d-4368-841e-b45802fecc4e) == -1 (Permission denied) --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Slow read performance
Have you run oprofile on the client and server simultaneously to see if there's some race condition developing? Obviously the NFS client is fine, but it's clear that there's nothing wrong with the hardware. oprofile will at least reveal where the the bits are vacationing and may point a specific bottleneck. oprofile.sf.net for docs and examples (pretty good); fairly easy to set up to profile applications; a bit more trouble if you're trying to profile kernel interactions, but it looks like you might not have to. I wouldn't want to forklift 160TB either. My sympathies. hjm On Thursday, March 07, 2013 09:27:42 PM Thomas Wakefield wrote: inode size is 256. Pretty stuck with these settings and ext4. I missed the memo that Gluster started to prefer xfs, back in the 2.x days xfs was not the preferred filesystem. At this point it's a 340TB filesystem with 160TB used. I just added more space, and was doing some followup testing and wasn't impressed with the results. But I am sure I was happier before with the performance. Still running CentOS 5.8 Anything else I could look at? Thanks, Tom ... --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] GlusterFS performance
This kind of info is surprisingly hard to obtain. The gluster docs do contain some of it, ie: http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/ I also found well-described kernel tuning parameters in the FHGFS wiki (as another distibuted fs, they share some characteristics) http://www.fhgfs.com/wiki/wikka.php?wakka=StorageServerTuning and more XFS tuning filesystem params here: http://www.mythtv.org/wiki/Optimizing_Performance#Further_Information and here: http://www.mysqlperformanceblog.com/2011/12/16/setting-up-xfs-the-simple- edition But of course, YMMV and a number of these parameters conflict and/or have serious tradeoffs, as you discovered. LSI recently loaned me a Nytro SAS controller (on-card SSD-cached) which seems pretty phenomenal on a single brick (and is predicted to perform well based on their profiling), but am waiting for another node to arrive before I can test it under true gluster conditions. Anyone else tried this hardware? hjm On Tuesday, March 05, 2013 12:34:41 PM Nikita A Kardashin wrote: Hello all! This problem is solved by me today. Root of all in the incompatibility of gluster cache and kvm cache. Bug reproduces if KVM virtual machine created with cache=writethrough (default for OpenStack) option and hosted on GlusterFS volume. If any other (cache=writeback or cache=none with direct-io) cacher used, performance of writing to existing file inside VM is equal to bare storage (from host machine) write performance. I think, it must be documented in Gluster and maybe filled a bug. Other question. Where I can read something about gluster tuning (optimal cache size, write-behind, flush-behind use cases and other)? I found only options list, without any how-to or tested cases. 2013/3/5 Toby Corkindale toby.corkind...@strategicdata.com.au On 01/03/13 21:12, Brian Candler wrote: On Fri, Mar 01, 2013 at 03:30:07PM +0600, Nikita A Kardashin wrote: If I try to execute above command inside virtual machine (KVM), first time all going right - about 900MB/s (cache effect, I think), but if I run this test again on existing file - task (dd) hungs up and can be stopped only by Ctrl+C. Overall virtual system latency is poor too. For example, apt-get upgrade upgrading system very, very slow, freezing on Unpacking replacement and other io-related steps. Does glusterfs have any tuning options, that can help me? If you are finding that processes hang or freeze indefinitely, this is not a question of tuning, this is simply broken. Anyway, you're asking the wrong person - I'm currently in the process of stripping out glusterfs, although I remain interested in the project. I did find that KVM performed very poorly, but KVM was not my main application and that's not why I'm abandoning it. I'm stripping out glusterfs primarily because it's not supportable in my environment, because there is no documentation on how to analyse and recover from failure scenarios which can and do happen. This point in more detail: http://www.gluster.org/**pipermail/gluster-users/2013-** January/035118.htmlhttp://www.gluster.org/pipermail/gluster-users/2013-J anuary/035118.html The other downside of gluster was its lack of flexibility, in particular the fact that there is no usage scaling factor on bricks, so that even with a simple distributed setup all your bricks have to be the same size. Also, the object store feature which I wanted to use, has clearly had hardly any testing (even the RPM packages don't install properly). I *really* wanted to deploy gluster, because in principle I like the idea of a virtual distribution/replication system which sits on top of existing local filesystems. But for storage, I need something where operational supportability is at the top of the pile. I have to agree; GlusterFS has been in use here in production for a while, and while it mostly works, it's been fragile and documentation has been disappointing. Despite 3.3 being in beta for a year, it still seems to have been poorly tested. For eg, I can't believe almost no-one else noticed that the log files were busted.. nor that the bug report has been around for quarter of a year without being responded to or fixed. I have to ask -- what are you moving to now, Brian? -Toby __**_ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.**org/mailman/listinfo/gluster-**usershttp://s upercolony.gluster.org/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- Something must be done. [X] is something. Therefore, we must
Re: [Gluster-users] Peer Probe
It /might be/probably is/ DNS-related. Are you trying to do this with RDMA or IPoIB? If IPoIB, are ALL your /etc/hosts files in sync (IB names separate and distinct from the ethernet interfaces) and responsive to the appropriate interfaces? Do the IB interfaces show up as distinct (and connected) on an 'ifconfig -a' and 'ibstat' dump? Do all the peers show up on an 'ibhosts' query? What is the output of: gluster volume status your_volume and gluster volume status your_volume detail hjm On Monday, February 25, 2013 07:46:00 PM Tony Saenz wrote: It shows this but it's still going through my NIC cards and not the Infiniband. (Checked the traffic on the cards themselves) [root@fpsgluster ~]# gluster peer status Number of Peers: 1 Hostname: fpsgluster2 Uuid: 9b7e7c2d-f05b-4cc8-b55a-571e383328d0 State: Peer in Cluster (Connected) On Feb 25, 2013, at 10:51 AM, Torbjørn Thorsen torbj...@trollweb.no wrote: Your error message seems to indicate that the peer is already in the storage pool ? What is the output of gluster peer status ? On Mon, Feb 25, 2013 at 7:28 PM, Tony Saenz t...@filmsolutions.com wrote: Any help please? The regular NICs are fine which is what it currently sees but I'd like to move them over to the Infiniband cards. On Feb 22, 2013, at 1:50 PM, Anthony Saenz t...@filmsolutions.com wrote: Hey, I was wondering if I could get a bit of help.. I installed a new Infiniband card into my servers but I'm unable to get it to come up as a peer. Is there something I'm missing? [root@fpsgluster testvault]# gluster peer probe fpsgluster2ib Probe on host fpsgluster2ib port 0 already in peer list [root@fpsgluster testvault]# yum list installed | grep gluster glusterfs.x86_64 3.3.1-1.el6 installed glusterfs-devel.x86_64 3.3.1-1.el6 installed glusterfs-fuse.x86_64 3.3.1-1.el6 installed glusterfs-geo-replication.x86_64 3.3.1-1.el6 installed glusterfs-rdma.x86_64 3.3.1-1.el6 installed glusterfs-server.x86_643.3.1-1.el6 installed Thanks. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users -- Vennlig hilsen Torbjørn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no Telefon dagtid: +47 51215300 Telefon kveld/helg: For kunder med Serviceavtale Besøksadresse: Luramyrveien 40, 4313 Sandnes Postadresse: Maurholen 57, 4316 Sandnes Husk at alle våre standard-vilkår alltid er gjeldende ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- Something must be done. [X] is something. Therefore, we must do it. Bruce Schneier, on American response to just about anything. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Peer Probe
That looks OK (but your 2 MTUs are mismatched - should fix that). UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 - 1st UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 - 2nd ^ IBDEV=ibX modprobe ib_umad modprobe ib_ipoib echo connected /sys/class/net/${IBDEV}/mode echo 65520 /sys/class/net/${IBDEV}/mtu how did you set up the peering? By name? by IP#? (I assume pinging by hostname also works both ways?) If you can't get the peers to ack, then what do the logs say on failure to: peer probe host or create the volume gluster volume create volnamehost1ib:/gl_part host2ib:/gl_part hjm On Monday, February 25, 2013 09:50:02 PM Tony Saenz wrote: Trying to first get this working with IPoIB [root@fpsgluster ~]# ibhosts Ca: 0x0011757937b2 ports 1 fpsgluster2 qib0 Ca: 0x001175792af2 ports 1 fpsgluster qib0 I'm able to ping the other box from Infiniband to Infiniband card Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:10.0.4.35 Bcast:10.0.4.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:79:2af2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:1567 errors:0 dropped:0 overruns:0 frame:0 TX packets:587 errors:0 dropped:24 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:342622 (334.5 KiB) TX bytes:96554 (94.2 KiB) [root@fpsgluster2 ~]# ifconfig ib0 Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib0 Link encap:InfiniBand HWaddr 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:10.0.4.34 Bcast:10.0.4.255 Mask:255.255.255.0 inet6 addr: fe80::211:7500:79:37b2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:599 errors:0 dropped:0 overruns:0 frame:0 TX packets:1558 errors:0 dropped:8 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:95180 (92.9 KiB) TX bytes:346728 (338.6 KiB) [root@fpsgluster ~]# ping -I ib0 10.0.4.34 PING 10.0.4.34 (10.0.4.34) from 10.0.4.35 ib0: 56(84) bytes of data. 64 bytes from 10.0.4.34: icmp_seq=1 ttl=64 time=12.6 ms 64 bytes from 10.0.4.34: icmp_seq=2 ttl=64 time=0.184 ms /etc/hosts looks correct [root@fpsgluster2 ~]# cat /etc/hosts | grep ib 10.0.4.35 fpsglusterib 10.0.4.34 fpsgluster2ib [root@fpsgluster ~]# cat /etc/hosts| grep ib 10.0.4.35 fpsglusterib 10.0.4.34 fpsgluster2ib I haven't created the new volume yet as I can't get the peer probe to work off the Infiniband card. It's only seeing the NIC cards I currently have it hooked in to. On Feb 25, 2013, at 11:57 AM, harry mangalam harry.manga...@uci.edu wrote: It /might be/probably is/ DNS-related. Are you trying to do this with RDMA or IPoIB? If IPoIB, are ALL your /etc/hosts files in sync (IB names separate and distinct from the ethernet interfaces) and responsive to the appropriate interfaces? Do the IB interfaces show up as distinct (and connected) on an 'ifconfig -a' and 'ibstat' dump? Do all the peers show up on an 'ibhosts' query? What is the output of: gluster volume status your_volume and gluster volume status your_volume detail hjm On Monday, February 25, 2013 07:46:00 PM Tony Saenz wrote: It shows this but it's still going through my NIC cards and not the Infiniband. (Checked the traffic on the cards themselves) [root@fpsgluster ~]# gluster peer status Number of Peers: 1 Hostname: fpsgluster2 Uuid: 9b7e7c2d-f05b-4cc8-b55a-571e383328d0 State: Peer in Cluster (Connected) On Feb 25, 2013, at 10:51 AM, Torbjørn Thorsen torbj...@trollweb.no wrote: Your error message seems to indicate that the peer is already in the storage pool ? What is the output of gluster peer status ? On Mon, Feb 25, 2013 at 7:28 PM, Tony Saenz t...@filmsolutions.com wrote: Any help please? The regular NICs are fine which is what it currently sees but I'd like to move them over to the Infiniband cards. On Feb 22, 2013, at 1:50 PM, Anthony Saenz t...@filmsolutions.com wrote: Hey, I was wondering if I could get a bit of help.. I installed a new Infiniband card into my servers but I'm unable to get it to come up as a peer. Is there something I'm missing? [root@fpsgluster testvault]# gluster peer probe fpsgluster2ib Probe
Re: [Gluster-users] high CPU load on all bricks
but nothing solid. If the consensus is that NFS will not gain anything then I won't waste the time setting it all up. NFS gains you the use of FSCache to cache directories and file stats making directory listings faster, but it adds overhead decreasing the overall throughput (from all the reports I've seen). I would suspect that you have the kernel nfs server running on your servers. Make sure it's disabled. Thanks, ~Mike C. From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Michael Colonno Sent: Friday, February 01, 2013 4:46 PM To: gluster-users@gluster.org Subject: Re: [Gluster-users] high CPU load on all bricks Update: after a few hours the CPU usage seems to have dropped down to a small value. I did not change anything with respect to the configuration or unmount / stop anything as I wanted to see if this would persist for a long period of time. Both the client and the self-mounted bricks are now showing CPU 1% (as reported by top). Prior to the larger CPU loads I installed a bunch of software into the volume (~ 5 GB total). Is this kind a transient behavior - by which I mean larger CPU loads after a lot of filesystem activity in short time - typical? This is not a problem in my deployment; I just want to know what to expect in the future and to complete this thread for future users. If this is expected behavior we can wrap up this thread. If not then I'll do more digging into the logs on the client and brick sides. Thanks, ~Mike C. From: Joe Julian [mailto:j...@julianfamily.org] Sent: Friday, February 01, 2013 2:08 PM To: Michael Colonno; gluster-users@gluster.org Subject: Re: [Gluster-users] high CPU load on all bricks Check the client log(s). Michael Colonno mcolo...@stanford.edu wrote: Forgot to mention: on a client system (not a brick) the glusterfs process is consuming ~ 68% CPU continuously. This is a much less powerful desktop system so the CPU load can't be compared 1:1 with the systems comprising the bricks but still very high. So the issue seems to exist with both glusterfsd and glusterfs processes. Thanks, ~Mike C. From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Michael Colonno Sent: Friday, February 01, 2013 12:46 PM To: gluster-users@gluster.org Subject: [Gluster-users] high CPU load on all bricks Gluster gurus ~ I've deployed and 8-brick (2x replicate) Gluster 3.3.1 volume on CentOS 6.3 with tcp transport. I was able to build, start, mount, and use the volume. On each system contributing a brick, however, my CPU usage (glusterfsd) is hovering around 20% (virtually zero memory usage thankfully). These are brand new, fairly beefy servers so 20% CPU load is quite a bit. The deployment is pretty plain with each brick mounting the volume to itself via a glusterfs mount. I assume this type of CPU usage is atypically high; is there anything I can do to investigate what's soaking up CPU and minimize it? Total usable volume size is only about 22 TB (about 45 TB total with 2x replicate). Thanks, ~Mike C. _ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- Something must be done. [X] is something. Therefore, we must do it. Bruce Schneier, on American response to just about anything. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS availability
On Thursday, January 31, 2013 11:28:04 AM glusterzhxue wrote: Hi all, As is known to us all, gluster provides NFS mount. However, if the mount point fails, clients will lose connection to Gluster. While if we use gluster native client, this fail will have no effect on clients. For example: mount -t glusterfs host1:/vol1 /mnt If host1 goes down for some reason, client works still, it has no sense about the failure(suppose we have multiple gluster servers). The client will still fail (in most cases) since host1 (if I follow you) is part of the gluster groupset. Certainly if it's a distributed-only, maybe not if it's a dist/repl gluster. But if host1 goes down, the client will not be able to find a gluster vol to mount. However, if we use the following: mount -t nfs -o vers=3 host1:/vol1 /mnt If host1 failed, client will lose connection to gluster servers. If the client was mounting the glusterfs via a re-export from an intermediate host, you might be able to failover to another intermediate NFS server, but if it was a gluster host, it would fail due to the reasons above. Now, we want to use NFS way. Could anyone give us some suggestion to solve the issue? Multiple intermediate NFS servers with round-robin addressing? Anyone tried this? Thanks Zhenghua --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- Something must be done. [X] is something. Therefore, we must do it. Bruce Schneier, on American response to just about anything. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Question regarding write performance issues
Just a guess, but how are the writes being done? If they're being written in zillions of tiny writes, then what you may be seeing is described here: http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html#writeperfongl and the following stanza on named pipes. This is often the case with the large files being used in NGS/HTS where the fasta/fastq files are composed of millions of short (60-100 char) lines of characters and are typically written line-by-line. hjm On Wednesday, January 16, 2013 02:47:37 PM Ayelet Shemesh wrote: Hi to all Gluster experts, I have a cluster of 10 machines exposing a volume into which 12 other machines do many writes of large files (~100-300MB each). In general I'm very happy with gluster. It's a great solution, and is quite stable (thanks for the great work!). However, I have a problem which I was unable to solve yet, nor find any solution to in the documentation or on this list archive. When the client machines write locally, and then just copy the files they created to the gluster mount - everything works great. When the client machines write directly to the gluster mounted volume I get a huge performance hit. In one specific test case the difference was 20 minutes for the copy and 8 hours for the direct write. I tried to set the iocache attributes of write-behind-window and flush-behind, but to no avail. I will very much appreciate your help in solving this problem. Thanks, Ayelet --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- Something must be done. [X] is something. Therefore, we must do it. Bruce Schneier, on American response to just about anything. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Redux: Stale NFS file handle
Following up on myself.. There is a RH bug report https://bugzilla.redhat.com/show_bug.cgi?id=678544 that sounds like what I'm seeing, but it's associated with the NFS client. We're using the gluster native client, so it shouldn't be related, but it looks like there may be some NFS client code that was used in the gluster native client code as well (the cause of the 'Stale NFS handle' warning in a non-NFS setting). Could this be the case? Could the pseudo-solution for the NFS case also work for the gluster native client case? ie: mount -o acdirmin=[23] ... It doesn't seem like 'acdirmin' is a mount option for glusterfs but is there a gluster equivalent? Also, this bug https://bugzilla.redhat.com/show_bug.cgi?id=844584 that complains about the the same thing as I reference below has this notation from Amar: --- ..this issue ['Stale NFS handle'] can happen in the case where file is changed on server after it was being accessed from one node (changed as in, re- created). as long as the application doesn't see any issue, this should be ok. --- In our case, it's causing the SGE array job to fail (or at least it appears to be highly related): = SGE error extract: Shepherd error: 01/03/2013 20:03:03 [507:64196]: error: can't chdir to /gl/bio/krthornt/build_div/yak/line10_CY22B/prinses: No such file or directory = glusterfs log on client (native gluster client): FS file handle. Path: /bio/krthornt/build_div/yak/line10_CY22B/prinses (590802f1-7fba-4103-ba30-e4d415b9db36) [2013-01-03 20:03:03.168229] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0- gl-client-1: remote operation failed: Stale NFS file handle. Path: /bio/krthornt/build_div/yak/line10_CY22B/prinses (590802f1-7fba-4103-ba30- e4d415b9db36) [2013-01-03 20:03:03.168287] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0- gl-client-2: remote operation failed: Stale NFS file handle. Path: /bio/krthornt/build_div/yak/line10_CY22B/prinses (590802f1-7fba-4103-ba30- e4d415b9db36) (and 13 more identical lines except for the timestamp (extending out to '20:03:17.202688'). so while we might not be losing data /per se/, this failure is definitely causing our cluster to lose jobs. If this has not been reported previously, I'd be happy to file an official bug report. hjm On Thursday, January 03, 2013 05:56:10 PM harry mangalam wrote: From ~ an hour's googling and reading, it looks like this (not uncommon) bug/warning/error has not necessarily been associated with data loss, but we are finding that our gluster fs is interrupting our cluster jobs with the 'Stale NFS handle' Warnings like this (on the client): [2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0- gl-client-0: remote operation failed: Stale NFS file handle. Path: /bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27- b515-32e94b1206e3) (and 7 more, differing by the timestamp of 1s). The dir mentioned existed before the job was asked to read from it and shortly after the SGE failed, I checked that the glusterfs (/bio) was still mounted and that dir was still r/w. We are getting these errors infrequently, but fairly regularly (a couple times a week, usually during a big array job that heavily reads from a particular dir) and I haven't seen any resolutions of the fault besides the vocabulary being corrected. I know it's not nec an NFS problem, but I haven't seen a fix from the gluster folks. Our glusterfs on this system is set up like this (over QDR/tcpoib) $ gluster volume info gl Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: auth.allow: 10.2.*.*,10.1.*.* performance.io-thread-count: 64 performance.quick-read: on performance.io-cache: on nfs.disable: on performance.cache-size: 268435456 performance.flush-behind: on performance.write-behind-window-size: 1024MB and otherwise appears to be happy. We were having a low-level problem with the RAID servers, where this LSI/3ware error was temporally close (~2m) to the gluster error: LSI 3DM2 alert -- host: biostor4.oit.uci.edu Jan 03, 2013 03:32:09PM - Controller 6 ERROR - Drive timeout detected: encl=1, slot=3 This error seemed to be related to construction around our data center and dust related with it. We have had 10s of these LSI/3ware errors with no related gluster errors or apparent problems with the RAIDs. No drives were ejected from the RAIDs and the errors did not repeat. 3ware explains: http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html == 009h Drive timeout detected The 3ware RAID controller has a sophisticated recovery mechanism to handle various types of failures
[Gluster-users] Redux: Stale NFS file handle
___ Gluster-users mailing list Gluster-users@gluster.org mailto:Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org mailto:Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster in a cluster
This doesn't seem like a good way to do what I think you want to do. 1 - /scratch should be as fast as possible, so putting it on a distributed fs, unless that fs is optimized for speed (mumble.Lustre.mumble), is a mistake. 2 - if you insist on doing this with gluster (perhaps because none of your individual /scratch partitions is large enough), making a dist replicated /scratch is making a bad decision worse as replication will slow the process down even more. (Why replicate what is a temp data store?) 3 - integrating the gluster server into the rocks environment (on a per-node basis) seems like a recipe for .. well, migraines, at least If you need a relatively fast, simple, large, reliable, aggressively caching fs for /scratch, NFS to a large RAID0/10 has some attractions, unless the gluster server fanout IO overwhelms the aforementioned attractions. IMHO... On Thursday, November 15, 2012 09:30:52 AM Jerome wrote: Dear all I'm testing Gluster in a cluster of compute nodes, based on Rocks. The idea is to use the scratch of each nodes as a big volume for scratch. It permit to access to this scratch system file on all the nodes of the cluster. For the moment, i have installed this gluster system on 4 nodes, ona distributed replica of 2, like this: # gluster volume info Volume Name: scratch1 Type: Distributed-Replicate Volume ID: c8c3e3fe-c785-4438-86eb-0b84c7c29123 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: compute-2-0:/state/partition1/scratch Brick2: compute-2-1:/state/partition1/scratch Brick3: compute-2-6:/state/partition1/scratch Brick4: compute-2-9:/state/partition1/scratch # gluster volume status Status of volume: scratch1 Gluster process PortOnline Pid -- Brick compute-2-0:/state/partition1/scratch24009 Y 16464 Brick compute-2-1:/state/partition1/scratch 24009 Y 3848 Brick compute-2-6:/state/partition1/scratch 24009 Y 511 Brick compute-2-9:/state/partition1/scratch 24009 Y 2086 NFS Server on localhost 38467 N 4060 Self-heal Daemon on localhost N/A N 4065 NFS Server on compute-2-0 38467 Y 16470 Self-heal Daemon on compute-2-0 N/A Y 16476 NFS Server on compute-2-9 38467 Y 2092 Self-heal Daemon on compute-2-9 N/A Y 2099 NFS Server on compute-2-6 38467 Y 517 Self-heal Daemon on compute-2-6 N/A Y 524 NFS Server on compute-2-1 38467 Y 3854 Self-heal Daemon on compute-2-1 N/A Y 3860 All of this run correctly, i used some stress file to advise that configuration could be runnable. My problem is when a node reboot accidentaly, or for some administration task: the node reinstall itself, and the gluster volume begin to fail I detect taht the UUID of a machine is generated during the instalation, so i develop some script to get back the original UUID of the node. Despote this, the node could not get back in the volume. I miss some special task to do. So, it is possible to do a such system with gluster? or maybe i have to reconfigure all of the voluem when a node reinstall? Best regards. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: http://www.americabutbetter.com/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Very slow directory listing and high CPU usage on replicated volume
Jeff Darcy wrote a nice piece in his hekafs blog about 'the importance of keeping things sequential' which is essentially about the contention for heads between data io and journal io. http://hekafs.org/index.php/2012/11/the-importance-of-staying-sequential/ (also congrats on the Linux Journal article on the glupy python/gluster approach). We've been experimenting with SSDs on ZFS (using the SSDs fo the ZIL (journal)) and while it's provided a little bit of a boost, it has not been dramatic. Ditto XFS. However, we did not stress it at all with heavy loads in a gluster env and I'm now thinking that there is where you would see the improvement. (see Jeff's graph about how the diff in threads/load affects IOPS). Is anyone running a gluster system with the underlying XFS writing the journal to SSDs? If so, any improvement? I would have expected to hear about this as a recommended architecture for gluster if it had performed MUCH better, but ...? We're about to combine 2 clusters and may just go ahead with this approach as a /scratch system to test this approach. hjm On Monday, November 05, 2012 07:58:22 AM Jonathan Lefman wrote: I take it back. Things deteriorated pretty quickly after I began dumping data onto my volume from multiple clients. Initially my transfer rates were okay, not fast, but livable. However after about an hour of copying several terabytes from 3-4 client machines, the rates of transfer often dropped to lb/s. Sometimes I would see a couple second burst of good transfer rates. Anyone have ideas on how to address this effectively? I'm at a loss. -Jon On Nov 2, 2012 1:21 PM, Jonathan Lefman jonathan.lef...@essess.com wrote: I should have also said that my volume is working well now and all is well. -Jon On Fri, Nov 2, 2012 at 1:21 PM, Jonathan Lefman jonathan.lef...@essess.com wrote: Thank you Brian. I'm happy to hear that this behavior is not typical. I am now using xfs on all of my drives. I also wiped out the entire /etc/glusterd directory for good measure. I bet that there was residual information from a previous attempt at a gluster volume that must have caused problems. Or moving to xfs from ext4 is an amazing fix, but I think this is less likely. I appreciate your time responding to me. -Jon On Nov 2, 2012 4:44 AM, Brian Candler b.cand...@pobox.com wrote: On Thu, Nov 01, 2012 at 08:03:21PM -0400, Jonathan Lefman wrote: Soon after loading up about 100 MB of small files (about 300kb each), the drive usage is at 1.1T. That is very odd. What do you get if you run du and df on the individual bricks themselves? 100MB is only ~330 files of 300KB each. Did you specify any special options to mkfs.ext4? Maybe -l 512 would help, as the xattrs are more likely to sit within the indoes themselves. If you start everything from scratch, it would be interesting to see df stats when the filesystem is empty. It may be that a huge amount of space has been allocated to inodes. If you expect most of your files 16KB then you could add -i 16384 to mkfs.ext4 to reduce the space reserved for inodes. But using xfs would be better, as it doesn't reserve any space for inodes, it allocates it dynamically. Ignore the comment that glusterfs is not designed for handling large count small files - 300KB is not small. Regards, Brian. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: http://www.americabutbetter.com/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] volume started but not 'startable', not 'stoppable'
Sorry for not responding immediately - I was drowning in flopsweat trying to get it back up. After some false starts, mostly due to premature mounting of the inconsistent volfile. remounting after the glusterfs came back and re-established a valid volfile seems to have resolved everything. Thanks very much for the help.. hjm On Monday, October 08, 2012 02:30:07 PM John Mark Walker wrote: I suspect this didn't go through - forwarding. - Original Message - Harry, Could you paste/attach the contents of /var/lib/glusterd/gli/info files and the glusterd log files from the 4 peers in cluster? From the volume-info snippet you had pasted, it appears that the node which was shutdown differs in its view of the volume's status. thanks, krish - Original Message - From: harry mangalam johnm...@johnmark.org To: gluster-users@gluster.org Sent: Monday, October 8, 2012 2:49:05 AM Subject: Re: [Gluster-users] volume started but not 'startable', not 'stoppable' And a few more data points: it appears the reason for the flaky gluster fs is that not all the servers are running glusterfsd's (see below). Is there a way to force the servers to all start the glusterfsd's as they're supposed to? The mystery rebalance did complete, and seems to have fixed some but not all problem files - ie: drwx-- 2 spoorkas spoorkas 8211 Jun 2 00:22 QPSK_2Tx_2Rx_BH_Method2/ ?- ? ???? QPSK_2Tx_2Rx_ML_Method1 And the started/not started status has gotten weirder if possble.. The gluster volume is still being exported to clients, despite gluster insisting that the volume is not started (servers are pbs[1234] result of $ gluster volume status pbs1:Volume gli is not started pbs2:Volume gli is not started pbs3:Volume gli is not started pbs4:Volume gli is not started $ gluster volume info: pbs1:Status: Stopped pbs2:Status: Started - aha! pbs3:Status: Started - aha! pbs4:Status: Started This correlates with the glusterfsd status in which only pbs[23] are running glusterfsd: pbs2:root 1799 0.1 0.0 184296 16464 ?Ssl 13:07 0:06 /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs2ib.bducgl -p /var/lib/glusterd/vols/gli/run/pbs2ib-bducgl.pid -S /tmp/c70b2f910e2fe1bb485b1d76ef63e3db.socket --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd- uuid=26de63bd-c5b7-48ba-b81d-5d77a533d077 --brick-port 24025 24026 --xlator- option gli-server.transport.rdma.listen-port=24026 --xlator-option gli- server.listen-port=24025 pbs3:root 1751 0.1 0.0 184168 16468 ?Ssl 13:07 0:06 /usr/sbin/glusterfsd -s localhost --volfile-id gli.pbs3ib.bducgl -p /var/lib/glusterd/vols/gli/run/pbs3ib-bducgl.pid -S /tmp/7096377992feb7f5a7805cafd82c3100.socket --brick-name /bducgl -l /var/log/glusterfs/bricks/bducgl.log --xlator-option *-posix.glusterd- uuid=c79c4084-d6b9-4af9-b975-40dd6aa99b42 --brick-port 24018 24020 --xlator- option gli-server.transport.rdma.listen-port=24020 --xlator-option gli- server.listen-port=24018 pbs[14] are only running the glusterd process, not any glusterfsd's In previous startups, pbs4 WAS running a glusterfsd, but pbs1 has not run one since the powerdown AFAIK. hjm On Saturday, October 06, 2012 10:19:14 PM harry mangalam wrote: ...and should have added: the rebalance log (the volume claimed to be rebalancing before I shut it down but was idle or wedged at that time) is active as well with about 1 warning of a 1 subvolumes down -- not fixing for every 3 informational messages: 2012-10-06 22:05:35.396650] I [dht-rebalance.c:1058:gf_defrag_migrate_data] 0-gli-dht: migrate data called on /nlduong/nduong2-t- illiac/workspace/m5_sim/trunk/src/arch/.svn/tmp/wcprops [2012-10-06 22:05:35.451925] I [dht-layout.c:593:dht_layout_normalize] 0-gli- dht: found anomalies in /nlduong/nduong2-t- illiac/workspace/m5_sim/trunk/src/arch/.svn/wcprops. holes=1 overlaps=0 [2012-10-06 22:05:35.451957] W [dht-selfheal.c:875:dht_selfheal_directory] 0- gli-dht: 1 subvolumes down -- not fixing previously... gluster 3.3, running on ubuntu 10.04, was running OK, had to shut down for a power outage. When I tried to shut it down, it insisted that it was rebalancing, but seeemed wedged - no activity in the logs. Was able to shut it down tho. After power was restored, tried to restart the volume but altho the 4 peers claimed to be visible and could ping each other etc: == Sat Oct 06 21
Re: [Gluster-users] volume started but not 'startable', not 'stoppable'
Hi Amar. Thanks SO much. That did it. (well, there are other remaining problems but they seem to be easily addressed relative to that initial problem). My Mum told me never to force anything, but you've proved her wrong. :) for others following in this thread - a 'force stop' and a 'force start' made everything come back. I'm re-running a rebalance to address some of the file inconsistencies, but almost all of them were resolved in the /forced/ restarting of the volume. hjm On Monday, October 08, 2012 03:46:54 PM Amar Tumballi wrote: And since they think it's not started, I can't stop it. How is this resolvable? can you try 'gluster volume stop VOLNAME force' ? (or 'gluster volume start VOLNAME force' Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: http://www.americabutbetter.com/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] volume started but not 'startable', not 'stoppable'
root 18 Sep 10 13:41 bonnie2/ drwx-- 2 spoorkas spoorkas 8211 Jun 2 00:22 QPSK_2Tx_2Rx_BH_Method2/ ?- ? ???? QPSK_2Tx_2Rx_ML_Method1 drwx-- 2 spoorkas spoorkas 8237 Jun 3 11:22 QPSK_2Tx_2Rx_ML_Method2/ drwx-- 2 spoorkas spoorkas 12288 Jun 4 01:24 QPSK_2Tx_3Rx_BH/ drwx-- 2 spoorkas spoorkas 4232 Jun 2 00:26 QPSK_2Tx_3Rx_BH_Method1/ drwx-- 2 spoorkas spoorkas 8274 Jun 2 00:34 QPSK_2Tx_3Rx_BH_Method2/ ?- ? ???? QPSK_2Tx_3Rx_ML_Method1 ?- ? ???? QPSK_2Tx_3Rx_ML_Method2 -rw-r--r-- 1 spoorkas spoorkas 0 Apr 17 14:16 simple.sh.e1802207 (These files appear to be intact on the individual bricks tho.) == Sat Oct 06 21:38:18 [0.76 0.71 0.58] root@pbs2:/var/log/glusterfs/bricks 568 $ gluster volume status Volume gli is not started == and since that is the case, other utilities also claim this: == Sat Oct 06 21:41:25 [1.04 0.84 0.65] root@pbs2:/var/log/glusterfs/bricks 571 $ gluster volume status gli detail Volume gli is not started == And since they think it's not started, I can't stop it. How is this resolvable? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: http://www.americabutbetter.com/ ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Retraction: Protocol stacking: gluster over NFS
Hi All, Well, it http://goo.gl/hzxyw was too good to be true. Under extreme, extended IO on a 48core node, some part of the the NFS stack collapses and leads to an IO lockup thru NFS. We've replicated it on 48core and 64 core nodes, but don't know yet whether it acts similarly on lower-core-count nodes. Tho I haven't had time to figure out exactly /how/ it collapses, I owe it to those who might be thinking of using it to tell them not to. This is what I wrote, describing the situation to some co-workers: === With Joseph's and Kevin's help, I've been able to replicate Kevin's complete workflow on BDUC and executed it with a normally mounted gluster fs and my gluster-via-NFS-loopback (on both NFS3 and NFS4 clients). The good news is that the workflow went to completion on BDUC with the native gluster fs mount, doing pretty decent IO on one node - topping out at about 250MB/s in and 75MB/s out (DDR IB) ib1 KB/s in KB/s out 268248.1 62278.40 262835.1 64813.55 248466.0 61000.24 250071.3 67770.03 252924.1 67235.13 196261.3 56165.20 255562.3 68524.45 237479.3 68813.99 209901.8 73147.73 217020.4 70855.45 The bad news is that I've been able to replicate the failures that JF has seen. The workflow starts normally but then eats up free RAM as KT's workflow saturates the nodes with about 26 instances of samtools which does a LOT of IO (10s of GB in the ~30m of the run). This was the case even when I increased the number of nfsd's to 16 and even 32. When using native gluster, the workflow goes to completion in about 23 hrs - about the same as when KT executed it on his machine (using NFS I think..?). However when using the loopback mount, on both NFS3 and NFS4, it locks up the NFS side (the gluster mount continues to be R/W), requiring a hard reset on the node to clear the NFS error. It is interesting that the samtools processes lock up during /reads/, not writes (via stracing several of the processes) I found this entry in a FraunhoferFS discussion: from https://groups.google.com/forum/?fromgroups=#!topic/fhgfs- user/XoGPbv3kfhc [[ In general, any network file system that uses the standard kernel page cache on the client side (including e.g. NFS, just to give another example) is not suitable for running client and server on the same machine, because that would lead to memory allocation deadlocks under high memory pressure - so you might want to watch out for that. (fhgfs uses a different caching mechanism on the clients to allow running it in such scenarios.) ]] but why this would be the case, I'm not sure - the server and client processes should be unable to step on each others data structures, so why they would interfere with each other is unclear. Others on this list have mentioned similar opinions - I'd be interested in why this is theoretically the case. The upshot is that under extreme, extended IO, NFS will lock up, so while we haven't seen it on BDUC except for KT's workflow, it's repeatable and we can't recover from it smoothly. So we should move away from it. I haven't been able to test it on a 3.x kernel (but will after this weekend); it's possible that it might work better, but I'm not optimistic. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Passive-Aggressive Supporter of the The Canada Party: http://www.americabutbetter.com/ ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS over gluster stops responding under write load
heading towards OOM territory. The glusterfs daemon is currently consuming 90% of MEM according to top. thanks ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- What does it say about a society that would rather send its children to kill and die for oil than to get on a bike? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] cannot create a new volume with a brick that used to be part of a deleted volume?
I believe gluster writes 2 entries into the top level of your gluster brick filesystems: -rw-r--r-- 2 root root36 2012-06-22 15:58 .gl.mount.check drw--- 258 root root 8192 2012-04-16 13:20 .glusterfs You will have to remove these as well as all the other fs info from the volume to re-add the fs as another brick. Or just remake the filesystem - instantaneous with XFS, less so with ext4. hjm On Tuesday, September 18, 2012 11:03:35 AM Lonni J Friedman wrote: Greetings, I'm running v3.3.0 on Fedora16-x86_64. I used to have a replicated volume on two bricks. This morning I deleted it successfully: [root@farm-ljf0 ~]# gluster volume stop gv0 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y Stopping volume gv0 has been successful [root@farm-ljf0 ~]# gluster volume delete gv0 Deleting volume will erase all information about the volume. Do you want to continue? (y/n) y Deleting volume gv0 has been successful [root@farm-ljf0 ~]# gluster volume info all No volumes present I then attempted to create a new volume using the same bricks that used to be part of the (now) deleted volume, but it keeps refusing failing claiming that the brick is already part of a volume: [root@farm-ljf1 ~]# gluster volume create gv0 rep 2 transport tcp 10.31.99.165:/mnt/sdb1 10.31.99.166:/mnt/sdb1 /mnt/sdb1 or a prefix of it is already part of a volume [root@farm-ljf1 ~]# gluster volume info all No volumes present Note farm-ljf0 is 10.31.99.165 and farm-ljf1 is 10.31.99.166. I also tried restarting glusterd (and glusterfsd) hoping that might clear things up, but it had no impact. How can /mnt/sdb1 be part of a volume when there are no volumes present? Is this a bug, or am I just missing something obvious? thanks ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- What does it say about a society that would rather send its children to kill and die for oil than to get on a bike? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Protocol stacking: gluster over NFS
Just to clarify, we are using the native kNFS-server from the distro, not gluster's NFS implementation. Note that CentOS 5.7/5.8 does not seem to support this kind of loopback mounting with the kNFS version they use (kNFS ver 1.0.8/9). However, the recent kNFS servers from Ubuntu/Debian (kNFSv 1.2.0) and SL 6.2 (1.2.3) do support it. We're still testing but have not yet found the kind of deadlocks/crashes that others have mentioned with the gluster NFS (touch wood). hjm On Monday, September 17, 2012 09:08:08 AM Jeff White wrote: I was under the impression that self-mounting NFS of any kind (mount -t nfs localhost...) was a dangerous thing. When I did that with gNFS I could cause a server to crash in no time at all with a simple dd into the mount point. I was under the impression that kNFS would have the same problem though I have not tested in myself (this was discussed in #gluster on irc.freenode.net some time ago). I'm guessing this would be a bug in the kernel. Has anyone seen issues or crashes with locally mounted NFS (either gNFS or kNFS)? Jeff White - GNU+Linux Systems Administrator University of Pittsburgh - CSSD On 09/14/2012 03:22 PM, John Mark Walker wrote: A note on recent history: There were past attempts to export GlusterFS client mounts over NFS, but those used the GlusterFS NFS service. I believe this is the first instance in the wild of someone trying this with knfsd. With the former, while there was increased performance, there would invariably be race conditions that would lock up GlusterFS. See the ominous warnings posted on this QA thread: http://community.gluster.org/a/nfs-performance-with-fuse-client-redundanc y/ I am curious to see if using knfsd, as opposed to GlusterFS' NFS service, yields a long-term solution for this type of workload. Please do continue to keep us updated. Thanks, JM -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- What does it say about a society that would rather send its children to kill and die for oil than to get on a bike? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Protocol stacking: gluster over NFS
Hi Venky - thank for the link to this translator. I'll take a look at it, but right now, we don't have too much trouble with reads - it's the 'zillions of tiny writes' problem that's hosing us and the NFS solution gives us a bit more headroom. We'll be moving this out to part of our cluster today (unless someone can convince me otherwise) and we'll see if it shows real-world improvements. hjm On Friday, September 14, 2012 11:33:34 AM Venky Shankar wrote: Hi Harry, There is a compression translator in Gerrit which you might be interested in: http://review.gluster.org/#change,3251 It compresses data (using zlib library) before it is sent out to the network from the server and on the other side (client; FUSE mount) it decompresses it. Also, note that it only does it for data transferred as part of the read fop and the volfiles needs to be hand edited (CLI support is still pending) I've not performed any performance run till now but I plan to do so soon. -Venky On Friday 14 September 2012 10:25 AM, harry mangalam wrote: Hi All, We have been experimenting with 'protocol stacking' - that is, running gluster over NFS. What I mean: - mounting a gluster fs via the native client, - then NFS-exporting the gluster fs to the client itself - then mounting that gluster fs via NFS3 to take advantage of the client-side caching. We've tried it on a limited basis (single client) and not only does it work, but it works surprisingly well, gaining about 2-3X the write performance relative to the native gluster mount on uncompressed data, using small writes. Using compressed data (piping thru gzip, for example) is more variable - if the data is highly compressible, it tends to increase performance; if less compressible, it tends to decrease performance. As I noted previously http://goo.gl/7G7k3, piping small writes thru gzip /can/ tremendously increase performance on a gluster fs in some bioinformatics applications. A graph of the performance on various file sizes (created by a trivial program that does zillions of tiny writes - a sore point in the gluster performance spectrum) is shown here: http://moo.nac.uci.edu/~hjm/fs_perf_gl-tmp-glnfs.png The graphs show the time to complete and sync on a set of writes from 10MB to 30GB on 3 fs's: - /tmp on the client's system disk (a single 10K USCSI) - /gl, a 4 server, 4 brick gluster (distributed-only) fs - /gl/nfs, the same gluster fs, loopback-mounted via NFS3 on the client The results show that using a gluster fs loopback-mounted to itself increased performance by 2-3X, increasing as the file size increased to 30GB. The client (64GB RAM) was otherwise idle when I did these tests. In addition (data not shown), I also tested how compression (piping the output thru gzip) affected the total time-to-complete. In one case, due to the identical string being written, gzip managed about 1000X compression, so the eventual file size sent to the disk was almost inconsequential. Nevertheless, the extra time for the compression was more than made up for by the reduced data and adding gzip decreased the time-to-complete significantly. In other testing with less compressible data (shown above), the compression time overwhelmed the write time and all the fs had essentially identical times per file size. In all cases, the writes were followed by a 'sync' to flush the cache. It seems that the loopback NFS mounting of the gluster fs is a fairly obvious win (overall, about 2-3x times the write speed) in terms of taking avantage of gluster's fs scaling and namespace with NFS3's client-side caching, but I'd like to hear from other gluster users as to possible downsides of this approach. hjm ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- What does it say about a society that would rather send its children to kill and die for oil than to get on a bike? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Protocol stacking: gluster over NFS
Well, it was too clever for me too :) - someone else suggested it when I was describing some of the options we were facing. I admit to initially thinking that it was silly to expect better performance by stacking protocols, but we tried it and it seems to have worked. To your point: the 'client' is the end node that uses the gluster storage - in our case it's a compute node (w/ limited storage) in a research cluster. the 'server' is the collection of nodes that provides the gluster storage. the client mounts the server with the native gluster client, providing all the gluster advantages of single namespace, scalability, reliability, etc. to 'client:/glmount' the client then exports that gluster fs via NFS to itself, so 'client:/glmount' is listed in '/etc/exports' as rw to itself. the client then mounts itself (innuendo and disturbing mental images notwithstanding) via NFS: 'mount -t nfs localhost:/glmount /glnfs' so that the gluster fs (/glmount) is NFS-loopback-mounted on the client (itself): from our test case, simplified: --- hmangala@claw5:~ $ cat /etc/mtab # (all non-gluster-related entries deleted) ... pbs1ib:/gli/glmount fuse.glusterfs \ rw,default_permissions,allow_other,max_read=131072 0 0 ... claw5:/glmount /glnfs nfs rw,addr=10.255.78.40 0 ... --- in the above extract, pbs1ib:/gli is the gluster fs that is mounted to 'claw5:/glmount'. claw5 then NFS-mounts claw5:/glmount onto /glnfs which users actually use to read/write. I agree, not very intuitive... but it seems to work. This is with NFS3 clients. NFS4 may provide an additional perf boost by allowing clients to work out of cache until it's forced to sync, but we haven't tried that yet and the test methodology we used wouldn't show a gain anyway. I'll have to try to create a more realistic test harness. hjm On Friday, September 14, 2012 01:04:59 PM Whit Blauvelt wrote: On Fri, Sep 14, 2012 at 09:41:42AM -0700, harry mangalam wrote: What I mean: - mounting a gluster fs via the native client, - then NFS-exporting the gluster fs to the client itself - then mounting that gluster fs via NFS3 to take advantage of the client-side caching. Harry, What is the client itself here? I'm having trouble picturing what's doing what with what. No doubt because it's too clever for me. Maybe a bit more description would clarify it nonetheless. Thanks, Whit ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- What does it say about a society that would rather send its children to kill and die for oil than to get on a bike? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Protocol stacking: gluster over NFS
Hi All, We have been experimenting with 'protocol stacking' - that is, running gluster over NFS. What I mean: - mounting a gluster fs via the native client, - then NFS-exporting the gluster fs to the client itself - then mounting that gluster fs via NFS3 to take advantage of the client-side caching. We've tried it on a limited basis (single client) and not only does it work, but it works surprisingly well, gaining about 2-3X the write performance relative to the native gluster mount on uncompressed data, using small writes. Using compressed data (piping thru gzip, for example) is more variable - if the data is highly compressible, it tends to increase performance; if less compressible, it tends to decrease performance. As I noted previously http://goo.gl/7G7k3, piping small writes thru gzip /can/ tremendously increase performance on a gluster fs in some bioinformatics applications. A graph of the performance on various file sizes (created by a trivial program that does zillions of tiny writes - a sore point in the gluster performance spectrum) is shown here: http://moo.nac.uci.edu/~hjm/fs_perf_gl-tmp-glnfs.png The graphs show the time to complete and sync on a set of writes from 10MB to 30GB on 3 fs's: - /tmp on the client's system disk (a single 10K USCSI) - /gl, a 4 server, 4 brick gluster (distributed-only) fs - /gl/nfs, the same gluster fs, loopback-mounted via NFS3 on the client The results show that using a gluster fs loopback-mounted to itself increased performance by 2-3X, increasing as the file size increased to 30GB. The client (64GB RAM) was otherwise idle when I did these tests. In addition (data not shown), I also tested how compression (piping the output thru gzip) affected the total time-to-complete. In one case, due to the identical string being written, gzip managed about 1000X compression, so the eventual file size sent to the disk was almost inconsequential. Nevertheless, the extra time for the compression was more than made up for by the reduced data and adding gzip decreased the time-to-complete significantly. In other testing with less compressible data (shown above), the compression time overwhelmed the write time and all the fs had essentially identical times per file size. In all cases, the writes were followed by a 'sync' to flush the cache. It seems that the loopback NFS mounting of the gluster fs is a fairly obvious win (overall, about 2-3x times the write speed) in terms of taking avantage of gluster's fs scaling and namespace with NFS3's client-side caching, but I'd like to hear from other gluster users as to possible downsides of this approach. hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] XFS and MD RAID
We're using 3ware Inc 9750 SAS2/SATA-II RAID controllers in a 4-brick, 400TB gluster system. The 4 have performed very well overall in about 6mo of production work, alerting us to problem disks, etc. Tho 3ware is an LSI product now, this model retains the familiar if somewhat grunty 3dm2 interface and usable cli, as opposed to the sulfuric acid enema interface of many native LSI controllers. We also have used mdadm with the older 8-port Marvell PCI-X software raid controller in another older gluster system which have worked flawlessly. lspci says: MV88SX6081 8-port SATA II PCI-X. This one is also ZFS- compatible. As others have said, using mdadm is much easier due to it's unified interface on top of heterogeneous hardware and if it's any slower, I haven't felt it. Using the hardware RAID was sort of forced on me due to fear of the commandline from others using the system. :( hjm On Monday, September 10, 2012 09:39:18 AM Brian Candler wrote: On Mon, Sep 10, 2012 at 09:29:25AM +0800, Jack Wang wrote: below patch should fix your bug. Thank you Jack - that was a very quick response! I'm building a new kernel with this patch now and will report back. However, I think the existence of this bug suggests that Linux with software RAID is unsuitable for production use. There has obviously been no testing of basic critical functionality like hot-plugging drives, and serious regressions are introduced into supposedly stable kernels. So I'm now on the lookout for a 24-port SATA RAID controller with good Linux support. What are my options? Googling I have found: * 3ware 9650SE-24 * Areca ARC-1280ML * LSI MegaRAID 9280-24i (newer SAS/SATA) * Areca ARC-1882ix-24 (newer SAS/SATA) However I see some people suggesting just a RAID card with a few ports plus a SAS expander backplane. This would be fine too - I don't mind an aggregate throughput limit of 6Gb/s for some or all of the drives. I just want to be sure that the RAID controller will handle all the possible failure modes and swap events of the various drives. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- What does it say about a society that would rather send its children to kill and die for oil than to get on a bike? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Non-progressing, Unstoppable rebalance on 3.3
Following an interchange with Jeff Darcy and Shishir Gowda, I started a rebalance of my cluster (3.3 on Ubuntu 10.04.4). Note: shortly after it started, 3/4 of the glusterfsd's shut down (which was exciting..). I stopped and restarted glusterd and the glusterfsd's restarted in turn and all was well, however it may have caused a problem with the rebalance: After 2 days of waiting, the rebalance has apparently done nothing (distracted by other things) and presents with the same values as it had originally: Thu Aug 23 10:35:11 [0.00 0.00 0.00] root@pbs1:/var/log/glusterfs 770 $ gluster volume rebalance gli status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost0000in progress pbs4ib0000not started pbs2ib 1380547324969 76863 completed pbs3ib0000not started (the above has the leading 32 blanks trimmed from the output - is there a reason for including those in the output?) the above implies that it is at least partially in progress, but after stopping it: Thu Aug 23 10:53:26 [0.00 0.00 0.00] root@pbs1:/var/log/glusterfs 774 $ gluster volume rebalance gli stop Node Rebalanced-files size scanned failures status - --- --- --- --- localhost0000in progress pbs4ib0000not started pbs2ib 1380547324969 76863 completed pbs3ib0000not started Stopped rebalance process on volume gli it still seems to be going: Thu Aug 23 10:53:28 [0.00 0.00 0.00] root@pbs1:/var/log/glusterfs 775 $ gluster volume rebalance gli status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost0000in progress pbs4ib0000not started pbs2ib 1380547324969 76863 completed pbs3ib0000not started Examining the server nodes, only pbs1 (localhost in the above output) had glusterfs running, and since it may have been 'orphaned' when I had the glusterfsd hiccups and has been hanging since that time. However, when I killed it, nothing changes. gluster still reports that the rebalance is in progress (even tho no glusterfs's are running on any of the nodes). If I try to reset it with a 'start force': Thu Aug 23 11:14:39 [0.06 0.04 0.00] root@pbs1:/var/log/glusterfs 789 $ gluster volume rebalance gli start force Rebalance on gli is already started and the status remains exactly as above. From the clients POV, all seems to be fine, but I've got a hanging rebalance that is both annoying and worrying. Is there a way to reset this smoothly, or dies it require a server restart? hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] files on gluster brick that have '---------T' designation.
I have a working but unbalanced gluster config where one brick has about 2X the usage of the 3 others. I started a remove-brick to force a resolution of this problem (Thanks to JD for the help!), but it's going very slowly, about 2.2MB/s over DDR IPoIB or ~2.3 files/s. In investigating the problem, I may have found a partial explanation - I have found 100s of thousands (maybe millions) of zero-length files existing on the problem brick that do not exist on the client view that have the designation '-T' via 'ls -l' ie: /bducgl/alamng/Research/Yuki/newF20/runF20_2513/data: total 0 -T 2 root root 0 2012-08-04 11:23 backward_sm1003 -T 2 root root 0 2012-08-04 11:23 backward_sm1007 -T 2 root root 0 2012-08-04 11:23 backward_sm1029 I suspect that these are the ones that are responsible for the enormous expansion of the storage space on this brick and the very slow speed of the 'remove-brick' operation. Does this sound possible? Can I delete these files on the brick to resolve the imbalance? If not, is there a way to process them in some better way to rationalize the imbalance? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] files on gluster brick that have '---------T' designation.
Hi Shishir, Thanks for your attention. Hmm - your explanation makes some sense, but those 'T' files don't show up in the client view of the dir - only in the brick view. Is that valid? I'm using 3.3 on 4 ubuntu 12.04 servers over DDR IPoIB, and the command to initiate the remove brick was: $ gluster volume remove-brick gli pbs3ib:/bducgl start and the current status is: $ gluster volume remove-brick gli pbs3ib:/bducgl status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost00 13770221406 stopped pbs2ib00 168991 6921 stopped pbs3ib 724683 594890945282 44028040in progress pbs4ib00 169081 7923 stopped (the failures were the same as were seen as when I tried the rebalance command previously). Best harry On Mon, Aug 20, 2012 at 7:09 PM, Shishir Gowda sgo...@redhat.com wrote: Hi Harry, These are valid files in glusterfs-dht xlator configured volumes. These are known as link files, which dht uses to maintain files on the hashed subvol, when the actual data resides in non hashed subvolumes(rename can lead to these). The cleanup of these files will be taken care of by running rebalance. Can you please provide the gluster version you are using, and the remove brick command you used? With regards, Shishir - Original Message - From: Harry Mangalam hjmanga...@gmail.com To: gluster-users gluster-users@gluster.org Sent: Tuesday, August 21, 2012 5:01:05 AM Subject: [Gluster-users] files on gluster brick that have '-T' designation. I have a working but unbalanced gluster config where one brick has about 2X the usage of the 3 others. I started a remove-brick to force a resolution of this problem (Thanks to JD for the help!), but it's going very slowly, about 2.2MB/s over DDR IPoIB or ~2.3 files/s. In investigating the problem, I may have found a partial explanation - I have found 100s of thousands (maybe millions) of zero-length files existing on the problem brick that do not exist on the client view that have the designation ' -T' via 'ls -l' ie: /bducgl/alamng/Research/Yuki/newF20/runF20_2513/data: total 0 -T 2 root root 0 2012-08-04 11:23 backward_sm1003 -T 2 root root 0 2012-08-04 11:23 backward_sm1007 -T 2 root root 0 2012-08-04 11:23 backward_sm1029 I suspect that these are the ones that are responsible for the enormous expansion of the storage space on this brick and the very slow speed of the 'remove-brick' operation. Does this sound possible? Can I delete these files on the brick to resolve the imbalance? If not, is there a way to process them in some better way to rationalize the imbalance? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] files on gluster brick that have '---------T' designation.
Hi Shishir, Here's the 'df -h' of the appropriate filesystem on all 4 of the gluster servers. It has equilibrated a bit since the original post - pbs3 has decreased from 73% and the others have increased from about 29%, but still, slow. pbs1:/dev/sdb6.4T 2.0T 4.4T 32% /bducgl pbs2:/dev/md08.2T 2.9T 5.4T 35% /bducgl pbs3:/dev/md127 8.2T 5.3T 3.0T 65% /bducgl pbs4:/dev/sda6.4T 2.2T 4.3T 34% /bducgl The 'errors-only' extract of the log (since the remove-brick was started) is here: http://moo.nac.uci.edu/~hjm/gluster/remove-brick_errors.log.gz (2707 lines) and the last 100 lines of the active log (gli-rebalance.log) is here: http://pastie.org/4559913 Thanks for your help. Harry On Mon, Aug 20, 2012 at 7:42 PM, Shishir Gowda sgo...@redhat.com wrote: Hi Harry, That is correct, the files wont be seen on the client. Can you provide an output of these: 1. df of all exports 2. Provide remove-brick/rebalance (volname-rebalance.log) log (if large just the failure messages, and tail of the the file). With regards, Shishir - Original Message - From: Harry Mangalam hjmanga...@gmail.com To: Shishir Gowda sgo...@redhat.com Cc: gluster-users gluster-users@gluster.org Sent: Tuesday, August 21, 2012 8:00:42 AM Subject: Re: [Gluster-users] files on gluster brick that have '-T' designation. Hi Shishir, Thanks for your attention. Hmm - your explanation makes some sense, but those 'T' files don't show up in the client view of the dir - only in the brick view. Is that valid? I'm using 3.3 on 4 ubuntu 12.04 servers over DDR IPoIB, and the command to initiate the remove brick was: $ gluster volume remove-brick gli pbs3ib:/bducgl start and the current status is: $ gluster volume remove-brick gli pbs3ib:/bducgl status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost 0 0 137702 21406 stopped pbs2ib 0 0 168991 6921 stopped pbs3ib 724683 594890945282 4402804 0 in progress pbs4ib 0 0 169081 7923 stopped (the failures were the same as were seen as when I tried the rebalance command previously). Best harry On Mon, Aug 20, 2012 at 7:09 PM, Shishir Gowda sgo...@redhat.com wrote: Hi Harry, These are valid files in glusterfs-dht xlator configured volumes. These are known as link files, which dht uses to maintain files on the hashed subvol, when the actual data resides in non hashed subvolumes(rename can lead to these). The cleanup of these files will be taken care of by running rebalance. Can you please provide the gluster version you are using, and the remove brick command you used? With regards, Shishir - Original Message - From: Harry Mangalam hjmanga...@gmail.com To: gluster-users gluster-users@gluster.org Sent: Tuesday, August 21, 2012 5:01:05 AM Subject: [Gluster-users] files on gluster brick that have '-T' designation. I have a working but unbalanced gluster config where one brick has about 2X the usage of the 3 others. I started a remove-brick to force a resolution of this problem (Thanks to JD for the help!), but it's going very slowly, about 2.2MB/s over DDR IPoIB or ~2.3 files/s. In investigating the problem, I may have found a partial explanation - I have found 100s of thousands (maybe millions) of zero-length files existing on the problem brick that do not exist on the client view that have the designation ' -T' via 'ls -l' ie: /bducgl/alamng/Research/Yuki/newF20/runF20_2513/data: total 0 -T 2 root root 0 2012-08-04 11:23 backward_sm1003 -T 2 root root 0 2012-08-04 11:23 backward_sm1007 -T 2 root root 0 2012-08-04 11:23 backward_sm1029 I suspect that these are the ones that are responsible for the enormous expansion of the storage space on this brick and the very slow speed of the 'remove-brick' operation. Does this sound possible? Can I delete these files on the brick to resolve the imbalance? If not, is there a way to process them in some better way to rationalize the imbalance? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long
Re: [Gluster-users] files on gluster brick that have '---------T' designation.
All the bricks are 3.3, and all the bricks were started via starting glusterd on each of them and then peer-probing etc. The initial reason for starting this fix-layout/rebalance/remove brick was a CPU overload on the pbs3 brick (load of 30 on a 4 CPU server) that was dramatically decreasing performance. I killed glusterd, restarted it and checked that it has re-established connection with the 'gluster peer status' command which implied that all 4 peers were connected. I didn't find out until later that this was incorrect using the ' gluster volume status VOLUME detail' command. So this peer has been hashed and thrashed somewhat (and miraculously is still serving files), but in the process, has gone out of proper balance with the other peers. It sounds like you're saying that this: Node Rebalanced-files size scanned failures status - --- --- --- --- localhost00 13770221406 stopped pbs2ib00 168991 6921 stopped pbs3ib 724683 594890945282 44028040in progress pbs4ib00 169081 7923 stopped implies that the other peers are not participating in the remove-brick? The change in storage across the servers implies that they are participating, just very slowly. On the other hand the last of the errors stopped 2 days ago (there are no more errors in the last 350MB of the rebalance logs, which also implies that the rest of the files are being migrated, just very slowly.. At any rate, if you've diagnosed the problem, what is the solution? A cluster-wide glusterd restart to sync the uuids? Or is there another way to re-identify them to each other? Best, Harry On Mon, Aug 20, 2012 at 9:06 PM, Shishir Gowda sgo...@redhat.com wrote: Hi Harry, Are all the bricks from 3.3? Or did you start any of the bricks manually (not through gluster volume commands) remove-brick/rebalance processes are started across all nodes(1-per node) of the volume. We use the node-uuid to distribute work across nodes. So migration is handled by all the nodes, to which the data belongs. In your case, there are errors being reported that the node-uuid is not available. With regards, Shishir - Original Message - From: Harry Mangalam hjmanga...@gmail.com To: Shishir Gowda sgo...@redhat.com Cc: gluster-users gluster-users@gluster.org Sent: Tuesday, August 21, 2012 8:37:06 AM Subject: Re: [Gluster-users] files on gluster brick that have '-T' designation. Hi Shishir, Here's the 'df -h' of the appropriate filesystem on all 4 of the gluster servers. It has equilibrated a bit since the original post - pbs3 has decreased from 73% and the others have increased from about 29%, but still, slow. pbs1:/dev/sdb 6.4T 2.0T 4.4T 32% /bducgl pbs2:/dev/md0 8.2T 2.9T 5.4T 35% /bducgl pbs3:/dev/md127 8.2T 5.3T 3.0T 65% /bducgl pbs4:/dev/sda 6.4T 2.2T 4.3T 34% /bducgl The 'errors-only' extract of the log (since the remove-brick was started) is here: http://moo.nac.uci.edu/~hjm/gluster/remove-brick_errors.log.gz (2707 lines) and the last 100 lines of the active log (gli-rebalance.log) is here: http://pastie.org/4559913 Thanks for your help. Harry On Mon, Aug 20, 2012 at 7:42 PM, Shishir Gowda sgo...@redhat.com wrote: Hi Harry, That is correct, the files wont be seen on the client. Can you provide an output of these: 1. df of all exports 2. Provide remove-brick/rebalance (volname-rebalance.log) log (if large just the failure messages, and tail of the the file). With regards, Shishir - Original Message - From: Harry Mangalam hjmanga...@gmail.com To: Shishir Gowda sgo...@redhat.com Cc: gluster-users gluster-users@gluster.org Sent: Tuesday, August 21, 2012 8:00:42 AM Subject: Re: [Gluster-users] files on gluster brick that have '-T' designation. Hi Shishir, Thanks for your attention. Hmm - your explanation makes some sense, but those 'T' files don't show up in the client view of the dir - only in the brick view. Is that valid? I'm using 3.3 on 4 ubuntu 12.04 servers over DDR IPoIB, and the command to initiate the remove brick was: $ gluster volume remove-brick gli pbs3ib:/bducgl start and the current status is: $ gluster volume remove-brick gli pbs3ib:/bducgl status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost 0 0 137702 21406 stopped pbs2ib 0 0 168991 6921 stopped pbs3ib 724683 594890945282 4402804 0 in progress pbs4ib 0 0 169081 7923 stopped (the failures were the same as were seen as when I tried the rebalance command previously). Best harry On Mon, Aug 20, 2012 at 7:09 PM, Shishir Gowda sgo...@redhat.com wrote: Hi Harry, These are valid files in glusterfs
Re: [Gluster-users] Problem mounting Gluster volume [3.3]
On Tue, Aug 14, 2012 at 7:56 AM, Paolo Di Tommaso paolo.ditomm...@gmail.com wrote: [2012-08-14 10:42:05.550471] E [socket.c:1715:socket_connect_finish] 0-glusterfs: connection to failed (No route to host) ping uses icmp whereas the mount command uses TCP and they can use different routing (as I was just taught by my network admins). Does ssh work? I believe that a gluster mount has to succeed by both forward and reverse DNS (like ssh.. well to work without spewing warnings). try traceroute to the server and see if its IP # resolves to the same host. Are the rest of the gluster hosts resolvable in the same way? Do you have old /etc/hosts entries that might interfere with the DNS resolution? hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;
Thanks for your comments. I use mdadm on many servers and I've seen md numbering like this a fair bit. Usually it occurs after a another RAID has been created and the numbering shifts. Neil Brown (mdadm's author) , seems to think it's fine. So I don't think that's the problem. And you're right - this is a Frankengluster made from a variety of chassis and controllers and normally it's fine. As Brian noted, it's all the same to gluster, mod some small local differences in IO performance. Re the size difference, I'll explicitly rebalance the brick after the fix-layout finishes, but I'm even more worried about this fantastic increase in CPU usage and its effect on user performance. In the fix-layout routines (still running), I've seen CPU usage of glusterfsd rise to ~400% and loadavg go up to 15 on all the servers (except the pbs3, the one that originally had that problem). That high load does not last long tho (maybe a few mintes - we've just installed nagios on these nodes and I'm getting a ton of emails about load increasing and then decreasing on all the nodes (except pbs3). When the load goes very high on a server node, the user-end performance drops appreciably. hjm On Sat, Aug 11, 2012 at 4:20 AM, Brian Candler b.cand...@pobox.com wrote: On Sat, Aug 11, 2012 at 12:11:39PM +0100, Nux! wrote: On 10.08.2012 22:16, Harry Mangalam wrote: pbs3:/dev/md127 8.2T 5.9T 2.3T 73% /bducgl --- Harry, The name of that md device (127) indicated there may be something dodgy going on there. A device shouldn't be named 127 unless some problems occured. Are you sure your drives are OK? I have systems with /dev/md127 all the time, and there's no problem. It seems to number downwards from /dev/md127 - if I create md array on the same system it is /dev/md126. However, this does suggest that the nodes are not configured identically: two are /dev/sda or /dev/sdb, which suggests either plain disk or hardware RAID, while two are /dev/md0 or /dev/127, which is software RAID. Although this could explain performance differences between the nodes, this is transparent to gluster and doesn't explain why the files are unevenly balanced - unless there is one huge file which happens to have been allocated to this node. Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;
On Sat, Aug 11, 2012 at 9:41 AM, Brian Candler b.cand...@pobox.com wrote: On Sat, Aug 11, 2012 at 08:31:51AM -0700, Harry Mangalam wrote: Re the size difference, I'll explicitly rebalance the brick after the fix-layout finishes, but I'm even more worried about this fantastic increase in CPU usage and its effect on user performance. This presumably means you were originally running the cluster with fewer nodes, and then added some later? No, but the unbalanced current situation suggests that at some point, it got out of balance. In the fix-layout routines (still running), I've seen CPU usage of glusterfsd rise to ~400% and loadavg go up to 15 on all the servers (except the pbs3, the one that originally had that problem). That high load does not last long tho (maybe a few mintes - we've just installed nagios on these nodes and I'm getting a ton of emails about load increasing and then decreasing on all the nodes (except pbs3). When the load goes very high on a server node, the user-end performance drops appreciably. Maybe worth trying an strace (strace -f -p pid 2strace.out) on the glusterfsd process, or whatever it is which is causing the high load, during such a burst, just for a few seconds. The output might give some clues. Good idea. I'll watch and when it goes wacko and post the filtered results. Thanks Harry -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;
running 3.3 distributed on IPoIB on 4 nodes, 1 brick per node. Any idea why, on one of those nodes, glusterfsd would go berserk, running up to 370% CPU and driving load to 30 (file performance on the clients slows to a crawl). While very slow, it continued to serve out files. This is the second time this has happened in about a week. I had turned on the gluster nfs services, but wasn't using it when this happened. It's now off. kill -HUP did nothing to either glusterd or glusterfsd, so I had to kill both and restart glusterd. That solved the overload on glusterfsd and performance is back to near normal. I'm now doing a rebalance/fix-layout which is running as expected, but will take the weekend to complete. I did notice that the affected node (pbs3) has more files than the others, tho I'm not sure that this is significant. Filesystem Size Used Avail Use% Mounted on pbs1:/dev/sdb6.4T 1.9T 4.6T 29% /bducgl pbs2:/dev/md08.2T 2.4T 5.9T 30% /bducgl pbs3:/dev/md127 8.2T 5.9T 2.3T 73% /bducgl --- pbs4:/dev/sda6.4T 1.8T 4.6T 29% /bducgl -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Unable to rebalance...status or stop after upgrade to 3.3
This sounds similar, tho not identical to a problem that I had recently (descriibed here: http://gluster.org/pipermail/gluster-users/2012-August/011054.html My problems resulted were teh result of starting this kind of rebalance with a server node appearing to be connected (via the 'gluster peer status' output, but not actually being connected as shown by the 'gluster volume status all detail' output. Note especially the part that describes its online state. -- Brick: Brick pbs3ib:/bducgl Port : 24018 Online : N = Pid : 20953 File System : xfs You may have already verified this, but what I did was to start a rebalance / fix-layout with a disconnected brick and it went ahead and tried to do it, unsuccessfully as you might guess.. But when I finally was able to reconnect the downed brick, and restart the rebalance, it (astonishingly) was able to bring everything back. So props to the gluster team. hjm On Wed, Aug 8, 2012 at 11:58 AM, Dan Bretherton d.a.brether...@reading.ac.uk wrote: Hello All- I have noticed another problem after upgrading to version 3.3. I am unable to do gluster volume rebalance VOLUME fix-layout status or ...fix-layout ... stop after starting a rebalance operation with gluster volume rebalance VOLUME fix-layout start. The fix-layout operation seemed to be progressing normally on all the servers according to the log files, but all attempts to do status or stop result in the CLI usage message being returned. The only reference to the rebalance commands in the log files were these, which all the servers seem to have one or more of. [root@romulus glusterfs]# grep rebalance *.log etc-glusterfs-glusterd.vol.**log:[2012-08-08 12:49:04.870709] W [socket.c:1512:__socket_proto_**state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (/var/lib/glusterd/vols/**tracks/rebalance/cb21050d-** 05c2-42b3-8660-230954bab324.**sock) tracks-rebalance.log:[2012-08-**06 10:41:18.550241] I [graph.c:241:gf_add_cmdline_**options] 0-tracks-dht: adding option 'rebalance-cmd' for volume 'tracks-dht' with value '4' The volume name is tracks by the way. I wanted to stop the rebalance operation because it seemed to be causing a very high load on some of the servers had been running for several days. I ended up having to manually kill the rebalance processes on all the servers followed by restarting glusterd. After that I found that one of the servers had rebalance_status=4 in file /var/lib/glusterd/vols/tracks/**node_state.infohttp://node_state.info, whereas all the others had rebalance_status=0. I manually changed the '4' to '0' and restarted glusterd. I don't know if this was a consequence of the way I had killed the rebalance operation or the cause of the strange behaviour. I don't really want to start another rebalance going to test because the last one was so disruptive. Has anyone else experienced this problem since upgrading to 3.3? Regards, Dan. __**_ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/**mailman/listinfo/gluster-usershttp://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] brick online or not? Don't trust 'gluster peer status'
As a final(?) follow-up to my problem, after restarting the rebalance with: gluster volume rebalance [vol-name] fix-layout start it finished up last night after plowing thru the entirety of the filesystem - fixing about ~1M files (apparently ~2.2TB), all while the fs remained live (tho probably a bit slower than users would have liked). That's a strong '+' in the gluster column for resiliency. I started the rebalance without waiting for any advice to the contrary. 3.3 is supposed to have a built-in rebalance operator, but I saw no evidence of it. Other info from gluster.org suggested that it wouldn't do any harm to do this, so I went ahead and started it. Do the gluster wizards have any final words on this before I write this up in our trouble report? best wishes harry On Thu, Aug 2, 2012 at 4:37 PM, Harry Mangalam hjmanga...@gmail.com wrote: Further to what I wrote before: gluster server overload; recovers, now Transport endpoint is not connected for some files http://goo.gl/CN6ud I'm getting conflicting info here. On one hand, the peer that had its glusterfsd lock up seems to be in the gluster system, according to the frequently referenced 'gluster peer status' Thu Aug 02 15:48:46 [1.00 0.89 0.92] root@pbs1:~ 729 $ gluster peer status Number of Peers: 3 Hostname: pbs4ib Uuid: 2a593581-bf45-446c-8f7c-212c53297803 State: Peer in Cluster (Connected) Hostname: pbs2ib Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077 State: Peer in Cluster (Connected) Hostname: pbs3ib Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 State: Peer in Cluster (Connected) On the other hand, some errors that I provided yesterday: === [2012-08-01 18:07:26.104910] W [dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes down -- not fixing === as well as this information: $ gluster volume status all detail [top 2 brick stanzas trimmed; they're online] -- Brick: Brick pbs3ib:/bducgl Port : 24018 Online : N = Pid : 20953 File System : xfs Device : /dev/md127 Mount Options: rw Inode Size : 256 Disk Space Free : 6.1TB Total Disk Space : 8.2TB Inode Count : 1758158080 Free Inodes : 1752326373 -- Brick: Brick pbs4ib:/bducgl Port : 24009 Online : Y Pid : 20948 File System : xfs Device : /dev/sda Mount Options: rw Inode Size : 256 Disk Space Free : 4.6TB Total Disk Space : 6.4TB Inode Count : 1367187392 Free Inodes : 1361305613 The above implies fairly strongly that the brick did not re-establish connection to the volume, altho the gluster peer info did. Strangely enough, when I RE-restarted the glusterd, it DID come back and re-joined the gluster volume and now the (restarted) fix-layout job is proceeding without those subvolumes down -- not fixing errors, just a steady stream of 'found anomalies/fixing the layout' messages, tho at the rate that it's going it looks like it will take several days. Still better several days to fix the data on-disk and having the fs live than having to tell users that their data is gone and then having to rebuild from zero. Luckily, it's officially a /scratch filesystem. Harry -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster-users Digest, Vol 51, Issue 46
Hi Ben, Thanks for the expert advice. On Fri, Aug 3, 2012 at 2:35 PM, Ben England bengl...@redhat.com wrote: 4. Re: kernel parameters for improving gluster writes on millions of small writes (long) (Harry Mangalam) Harry, You are correct, Glusterfs throughput with small write transfer sizes is a client-side problem, here are workarounds that at least some applications could use. Not to be impertinent nor snarky, but why is the gluster client written in this way and is that a high priority for fixing? It seems that caching/buffering is one of the great central truths of computer science in general. Is there a countering argument for not doing this? 1) NFS client is one workaround, since it buffers writes using the kernel buffer cache. Yes, I tried this and I find the same thing. One thing I am unclear about tho is whether you can set up and run 1 NFS server per gluster server node. ie my glusterfs runs on 4 servers - could I connect clients to each one using a round robin selection or other load/bandwidth balancing approach? I've read opinions that seem to support both yes and no. 2) If your app does not have a configurable I/O size, but it lets you write to stdout, you can try piping your output to stdout and letting dd aggregate your I/O to the filesystem for you. In this example we triple single-thread write throughput for 4-KB I/O requests in this example. I agree again - I wrote up this for the gluster 'hints' http://goo.gl/NyMXO using gzip (other utilities seem to work as well, as do named pipes for handling more complex output options. [nice examples deleted] 3) If your program is written in C and it uses stdio.h, you can probably do setvbuf() C RTL call to increase buffer size to something greater than 8 KB, which is the default in gcc-4.4. Most of our users are not programmers and so this is not an option in most cases. http://en.cppreference.com/w/c/io/setvbuf ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Change NFS parameters post-start
Thanks Joe, for this (and other help on the IRC) Yes, I did check this and no it's not running. Harry On Fri, Aug 3, 2012 at 4:26 PM, Joe Julian j...@julianfamily.org wrote: You also have to ensure that the kernel nfs server isn't running. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Unable to peer probe after upgrade to 3.3
-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] brick online or not? Don't trust 'gluster peer status'
Further to what I wrote before: gluster server overload; recovers, now Transport endpoint is not connected for some files http://goo.gl/CN6ud I'm getting conflicting info here. On one hand, the peer that had its glusterfsd lock up seems to be in the gluster system, according to the frequently referenced 'gluster peer status' Thu Aug 02 15:48:46 [1.00 0.89 0.92] root@pbs1:~ 729 $ gluster peer status Number of Peers: 3 Hostname: pbs4ib Uuid: 2a593581-bf45-446c-8f7c-212c53297803 State: Peer in Cluster (Connected) Hostname: pbs2ib Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077 State: Peer in Cluster (Connected) Hostname: pbs3ib Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 State: Peer in Cluster (Connected) On the other hand, some errors that I provided yesterday: === [2012-08-01 18:07:26.104910] W [dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes down -- not fixing === as well as this information: $ gluster volume status all detail [top 2 brick stanzas trimmed; they're online] -- Brick: Brick pbs3ib:/bducgl Port : 24018 Online : N = Pid : 20953 File System : xfs Device : /dev/md127 Mount Options: rw Inode Size : 256 Disk Space Free : 6.1TB Total Disk Space : 8.2TB Inode Count : 1758158080 Free Inodes : 1752326373 -- Brick: Brick pbs4ib:/bducgl Port : 24009 Online : Y Pid : 20948 File System : xfs Device : /dev/sda Mount Options: rw Inode Size : 256 Disk Space Free : 4.6TB Total Disk Space : 6.4TB Inode Count : 1367187392 Free Inodes : 1361305613 The above implies fairly strongly that the brick did not re-establish connection to the volume, altho the gluster peer info did. Strangely enough, when I RE-restarted the glusterd, it DID come back and re-joined the gluster volume and now the (restarted) fix-layout job is proceeding without those subvolumes down -- not fixing errors, just a steady stream of 'found anomalies/fixing the layout' messages, tho at the rate that it's going it looks like it will take several days. Still better several days to fix the data on-disk and having the fs live than having to tell users that their data is gone and then having to rebuild from zero. Luckily, it's officially a /scratch filesystem. Harry -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] gluster server overload; recovers, now Transport endpoint is not connected for some files
/benchmarks/ALPBench/Face_Rec/data/EBGM_CSUNG [2012-08-01 18:04:31.462275] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/EBGM_CSU_FG [2012-08-01 18:04:31.778421] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots [2012-08-01 18:04:31.885009] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/normSep2002sfi [2012-08-01 18:04:32.337981] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/source [2012-08-01 18:04:32.441383] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/source/pgm [2012-08-01 18:04:32.558827] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/faceGraphsWiskott [2012-08-01 18:04:32.617823] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/ALPBench/Face_Rec/data/novelGraphsWiskott Unfortunately, I'm also seeing this: [2012-08-01 18:07:26.104859] I [dht-layout.c:593:dht_layout_normalize] 0-gli-dht: found anomalies in /nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/input. holes=1 overlaps=0 [2012-08-01 18:07:26.104910] W [dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes down -- not fixing [2012-08-01 18:07:26.104996] I [dht-common.c:2337:dht_setxattr] 0-gli-dht: fixing the layout of /nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/input [2012-08-01 18:07:26.189403] I [dht-layout.c:593:dht_layout_normalize] 0-gli-dht: found anomalies in /nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/output. holes=1 overlaps=0 [2012-08-01 18:07:26.189457] W [dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes down -- not fixing which implies that some of the errors are not fixable. Is there a best-practices solution for this problem? I suspect this is one of the most common problems to affect an operating gluster fs. hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Change NFS parameters post-start
In trying to convert clients from using the gluster native client to an NFS client, I'm trying to get the gluster volume mounted on a test mount point on the same client that the native client has mounted the volume. The client refuses with the error: mount -t nfs bs1:/gl /mnt/glnfs mount: bs1:/gl failed, reason given by server: No such file or directory In looking at the gluster nfs.log, it looks like the nfs volume was mounted RDMA-only, which is odd, seeing that the 3.3-1 does not fully support RDMA: the nfs.log is 99% these failure messages: E [rdma.c:4458:tcp_connect_finish] 0-gl-client-2: tcp connect to etc but the few messages that aren't these reveal that: 1: volume gl-client-0 2: type protocol/client 3: option remote-host bs2 4: option remote-subvolume /raid1 5: option transport-type rdma - 6: option username a2994eef-60d6-4609-a6d1-8d760cf82424 7: option password bbf8e05d-6ada-4371-99d0-09b4c55cc899 8: end-volume The volume was created tcp,rdma (before I realized that rdma was temporarily deprecated): Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: off performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* and the gluster clients talk to it just fine over IPoIB But the NFS client apparently insists on trying to use RDMA, which isn't being used I didn't originally ask for or want the NFS subsystem and later turned it off, (nfs.disable = on) but now i want to use it and I'd like to be able to tell it to use sockets/TCP. Is there a way to do this after the fact? That is, without destroying and re-creating the current volume as tcp-only. I have another gluster FS where the transport type is set to tcp and it's working fine under NFS: 1: volume gli-client-0 2: type protocol/client 3: option remote-host pbs1ib 4: option remote-subvolume /bducgl 5: option transport-type tcp 6: option username c173a866-a561-4da9-b977-93f8df4766a1 7: option password 09480722-0b0f-4b41-bc73-9970fe129d27 8: end-volume hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)
Hi Bryan, thanks for the suggestion. In fact, we're using XFS for the underlying filesystem (under 3ware controllers) and it was tuned (or at least I thought it was) for large files. We do get decent perf on large file reads and writes as long as the writes are fairly large. I'll post my controller and XFS settings to see if they seem odd. I was experimenting more last night after the latest revelations and discovered some more things that may be illuminating. I wrote 2 tiny perl scripts; one that wrote ~400MB in short writes (burp) and another (bigburp) that created the same-sized string but in-memory and then wrote it in one write. The burp script took about 4 times as long to write to a file on a gluster fs (and sync) as did the bigburp script. So, if the writes are as I described previously (the output of individual writes of 100bytes), the performance is very poor and the gluster process is driven very high - 100% for several seconds, I'm assuming due to queued instructions). If the same amount of data is written in a single write, the performance is pretty good and while the gluster process goes high, it doesn't exceed about 60% and it lasts only a few sec. Why should this be? Why should Linux file caching care if the data to be written is the result of a single write or the result of lots of writes (other than the function call overhead - would that explain it?)...? I can test that with oprofile, but it doesn't explain why the gluster process takes so much longer to process one than the other. From its POV, it should just be data, regardless from where it came. Or am I missing some critical point? If it does matter that it's not just the size of the files but the way they are created that has a large effect on gluster write performance, then gluster (or at least the native gluster client) will not be appropriate for a lot of bioinformatics apps, many of which use this kind of write profile. hjm On Thu, Jul 26, 2012 at 6:23 AM, Washer, Bryan bwas...@netsuite.com wrote: Harry, Just a question but what file system are you using under the gluster system? You may need to tune that before you continue to try and tune the output system. I found that by tuning using the xfs file system and tuning it for very large files I was able to improve my performance quite a bit. In this case though I was working with a lot of big files so my tuning would not help you..but just wanted to make sure you had looked at this detail in your setup. Bryan -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Harry Mangalam Sent: Wednesday, July 25, 2012 8:02 PM To: gluster-users Subject: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long) This is a continuation of my previous posts about improving write perf when trapping millions of small writes to a gluster filesystem. I was able to improve write perf by ~30x by running STDOUT thru gzip to consolidate and reduce the output stream. Today, another similar problem, having to do with yet another bioinformatics program (which these days typically handle the 'short reads' that come out of the majority of sequencing hardware, each read being 30-150 characters, with some metadata typically in an ASCII file containing millions of such entries). Reading them doesn't seem to be a problem (at least on our systems) but writing them is quite awful.. The program is called 'art_illumina' from the Broad Inst's 'ALLPATHS' suite and it generates an artificial Illumina data set from an input genome. In this case about 5GB of the type of data described above. Like before, the gluster process goes to 100% and the program itself slows to ~20-30% of a CPU. In this case, the app's output cannot be extrnally trapped by redirecting thru gzip since the output flag specifies the base filename for 2 files that are created internally and then written directly. This prevents even setting up a named pipe to trap and process the output. Since this gluster storage was set up specifically for bioinformatics, this is a repeating problem and while some of the issues can be dealt with by trapping and converting output, it would be VERY NICE if we could deal with it at the OS level. The gluster volume is running over IPoIB on QDR IB and looks like this: Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1
Re: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)
I had not, tho I had searched for something like this for a good bit yesterday(?!) Back to google class for me.. Thanks very much! hjm On Thu, Jul 26, 2012 at 8:07 AM, John Mark Walker johnm...@redhat.com wrote: Harry, Have you seen this post? http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/ Be sure and read all the comments, as Ben England chimes in on the comments, and he's one of the performance engineers at Red Hat. -JM - Harry Mangalam hjmanga...@gmail.com wrote: This is a continuation of my previous posts about improving write perf when trapping millions of small writes to a gluster filesystem. I was able to improve write perf by ~30x by running STDOUT thru gzip to consolidate and reduce the output stream. Today, another similar problem, having to do with yet another bioinformatics program (which these days typically handle the 'short reads' that come out of the majority of sequencing hardware, each read being 30-150 characters, with some metadata typically in an ASCII file containing millions of such entries). Reading them doesn't seem to be a problem (at least on our systems) but writing them is quite awful.. The program is called 'art_illumina' from the Broad Inst's 'ALLPATHS' suite and it generates an artificial Illumina data set from an input genome. In this case about 5GB of the type of data described above. Like before, the gluster process goes to 100% and the program itself slows to ~20-30% of a CPU. In this case, the app's output cannot be extrnally trapped by redirecting thru gzip since the output flag specifies the base filename for 2 files that are created internally and then written directly. This prevents even setting up a named pipe to trap and process the output. Since this gluster storage was set up specifically for bioinformatics, this is a repeating problem and while some of the issues can be dealt with by trapping and converting output, it would be VERY NICE if we could deal with it at the OS level. The gluster volume is running over IPoIB on QDR IB and looks like this: Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.write-behind-window-size: 1024MB performance.flush-behind: on performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* I've tried to increase every caching option that might improve this kind of performance, but it doesn't seem to help. At this point, I'm wondering whether changing the client (or server) kernel parameters will help. The client's meminfo is: cat /proc/meminfo MemTotal: 529425924 kB MemFree:241833188 kB Buffers: 355248 kB Cached: 279699444 kB SwapCached:0 kB Active: 2241580 kB Inactive: 278287248 kB Active(anon): 190988 kB Inactive(anon): 287952 kB Active(file):2050592 kB Inactive(file): 277999296 kB Unevictable: 16856 kB Mlocked: 16856 kB SwapTotal: 563198732 kB SwapFree: 563198732 kB Dirty: 1656 kB Writeback: 0 kB AnonPages:486876 kB Mapped:19808 kB Shmem: 164 kB Slab:1475476 kB SReclaimable:1205944 kB SUnreclaim: 269532 kB KernelStack:5928 kB PageTables:27312 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:827911692 kB Committed_AS: 536852 kB VmallocTotal: 34359738367 kB VmallocUsed: 1227732 kB VmallocChunk: 33888774404 kB HardwareCorrupted: 0 kB AnonHugePages:376832 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 201088 kB DirectMap2M:15509504 kB DirectMap1G:521142272 kB and the server's meminfo is: $ cat /proc/meminfo MemTotal: 32861400 kB MemFree: 1232172 kB Buffers: 29116 kB Cached: 30017272 kB SwapCached: 44 kB Active: 18840852 kB Inactive: 11772428 kB Active(anon): 492928 kB Inactive(anon):75264 kB Active(file): 18347924 kB Inactive(file): 11697164 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 16382900 kB SwapFree: 16382680 kB Dirty: 8 kB Writeback: 0 kB AnonPages:566876 kB Mapped:14212 kB Shmem: 1276 kB Slab: 429164 kB SReclaimable: 324752 kB SUnreclaim: 104412 kB KernelStack:3528 kB PageTables:16956 kB
Re: [Gluster-users] kernel parameters for improving gluster writes on millions of small writes (long)
I read and am still digesting the kernel tuning parameters mentioned in John's link. There's another useful link that expands on some of the same points here: The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads http://www.westnet.com/~gsmith/content/linux-pdflush.htm However, while I digest them, I have a few more observations: It's not that the server is slow, it's the gluster native client that is. So I'm not sure that increasing the perf of the server will help much at this point. I wrote a tiny script (burp.pl) that just emits lots of short strings to stdout like the problem app that originated this discussion (and a colleague did the same with a C++ app) to verify. If I send stdout to my gluster fs via the native gluster client, I observe a steady stream of data at about 14MB/s (this is on a DDR/IPoIB cluster) $ time `./burp.pl 100/gl/hmangala/burp.out sync` real0m29.646s user0m17.830s sys 0m2.000s In this case, burp.pl is only getting about 70% of a CPU and the gluster process is getting ~40%. Here'e the ifstat output for the IB channel (~1 entry/s). Note the continuous data out rate of about 14MB/s (and the odd input rate of about 1MB/s). ib1 KB/s in KB/s out 0.00 0.00 0.00 0.00 burp starts 383.34 5200.51 1039.43 14243.11 1031.59 14132.11 1037.36 14223.32 1044.20 14304.81 1040.40 14288.45 1037.78 14217.64 1042.19 14306.66 1036.54 14200.05 1062.26 14699.87 1072.64 14711.29 1072.87 14694.52 1065.18 14608.67 1074.23 14711.32 1073.26 14711.43 1069.79 14672.60 1066.66 14608.58 1067.68 14647.14 1074.16 14711.48 1069.16 14651.39 1077.19 14767.32 1075.74 14736.75 1068.77 14634.86 1066.81 14625.90 1063.89 14586.81 1064.79 14608.46 1065.37 14583.04 1065.10 14604.44 1063.86 14591.14 388.41 5323.84burp ends 0.00 0.00 --- 30460.65 417607.51 totals (30MB input vs 417MB output) for the NFS mounted channel; (mount command: mount -o mountproto=tcp,vers=3,noatime,auto -t nfs pbs1ib:/gli /mnt/glnfs $ time `./burp.pl 100/mnt/glnfs/hmangala/burp.out sync` real0m24.704s a little faster user0m20.710s sys 0m0.810s In this case burp.pl gets 100% of a CPU; gluster isn't involved and so doesn't register. Here'e the ifstat output for the IB channel: Note the complete lack of input and no data output until the very end when it bursts at ~140MB/s. ib1 KB/s in KB/s out 0.00 0.00 0.73 0.00 burp starts 0.18 0.00 0.00 0.00 1.33 1.88 0.00 0.00 0.00 0.00 0.04 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 1.29 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 314.08 83002.70 517.10 142239.6 513.94 141469.4 123.96 33219.89 burp ends 0.04 0.00 1471.44 399934.76 Totals (1.47MB input vs 400MB output) It's hard to argue with that. NFS is clearly superior / more efficient on a single process and may be more efficient overall for the use cases on our clusters. So why doesn't the gluster native client do client-side caching like NFS? It looks like it's explicitly refusing to be cached by the usual (and usually excellent) Linux mechanisms. What's the reason for declining this OS advantage on the client side while providing such a technically sweet solution on the server side? I'm at a loss to explain this behavior to our technical group. previous deleted -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] temp fix: Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.
The problem described in the subject appears NOT to be the case. It's not that simultaneous reads and writes dramatically decrease perf, but that the type of /writes/ being done by this app (bedtools) kills performance. If this was a self-writ app or an infrequently used one, I wouldn't bother writing this up, but bedtools is a fairly popular genomics app and since many installations use gluster to host Next-Gen sequencing data and analysis, I thought I'd follow up on my own post. The short version: = Insert gzip to compress and stream the data before sending it to gluster fs. The improvement in IO (and application) performance is dramatic. ie (all files on a gluster fs) genomeCoverageBed -ibam RS_11261.bam -g \ ref/dmel-all-chromosome-r5.1.fasta -d |gzip output.cov.gz inserting the '| gzip' increased the app speed by more than 30X (relative to not using it on a gluster fs; however it even improved the wall clock speed of the app relative to running on a local filesystem by about 1/3), decreased the gluster CPU utilization by ~99% and reduced the output size by 80%. So, wins all round. The long version: The type of writes that bedtools does is also fairly common - lots of writes of tiny amounts of data. As I understand it (which may be wrong; please correct) the gluster native client (which we're using) does not buffer IO as well as the NFS client, which is why we frequently see complaints about gluster vs NFS perf. The apparent problem for bedtools is that these zillions of tiny writes are being handled separately or at least not cached well to be consolidated in a large write. To present the data to gluster as a continuous stream instead of these tiny writes, they have to be 'converted' to such a stream. gzip is a nice solution because it compresses as it converts. Aparently anything that takes STDIN, buffers it appropriately and then spits it out on STDOUT will work. Even piping the data thru 'cat' will work to allow bedtools to continue to run at 100%, tho it will increase the gluster CPU utilization to 90%. 'cat' of course uses less CPU (~14%) while gzip will use more (~60%) tho decreasing gluster;s use enormously. I did try the performance options I mentioned earlier: performance.write-behind-window-size: 1024MB performance.flush-behind: on They did not seem to help at all and I'd still like an explanation of what they're supposed to do. The upshot is that this seems like, if not a bug, then at least an opportunity to improve gluster perfomance considerably. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.
I have fairly new gluster fs of 4 nodes with 2 RAID6 bricks on each node connected to a cluster via IPoIB on QDR IB. The servers are all SL6.2, running gluster 3.3-1; the clients are running the gluster-released glusterfs-fuse-3.3.0qa42-1 glusterfs-3.3.0qa42-1. The volume seems normal: $ gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* The logs on both the server and client are remarkable in their lack of anything amiss (the server has the previously reported zillion times repeating string of: I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now which seems to be correlated with turning the NFS server off. This has been mentioned before. The gluster volume log, stripped of that line is here: http://pastie.org/4309225 Individual large-file reads and writes are in the 300MB/s range which is not magnificent but tolerable. However, we've recently detected what appears to be a conflict in reading and writing for some applications. When some applications are reading and writing to the gluster fs, the client /usr/sbin/glusterfs increases its CPU consunmption to 100% and the IO goes to almost zero. When the inputs are on the gluster fs and the output is on another fs, performance is as good as on a local RAID. This seems to be specific to a particular application (bedtools, perhaps some other openmp genomics apps - still checking). Other utilities (cp, perl, tar, and other utilities ) that read and write to the gluster filesystem seem to be able to push and pull fairly large amount of data to/from it. The client is running a genomics utility (bedtools) which reads a very large chunks of data from the gluster fs, then aligns it to a reference genome. Stracing the run yields this stanza, after which it hangs until I kill it. The user has said that it does complete but at a speed hundreds of times slower (maybe timing out at each step..?) open(/data/users/tdlong/bin/genomeCoverageBed, O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, #!/bin/sh\n${0%/*}/bedtools genom..., 80) = 42 lseek(3, 0, SEEK_SET) = 0 getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0 dup2(3, 255)= 255 close(3)= 0 fcntl(255, F_SETFD, FD_CLOEXEC) = 0 fcntl(255, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) fstat(255, {st_mode=S_IFREG|0755, st_size=42, ...}) = 0 lseek(255, 0, SEEK_CUR) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(255, #!/bin/sh\n${0%/*}/bedtools genom..., 42) = 42 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2ae9318729e0) = 8229 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x3cb64302d0}, {SIG_DFL, [], SA_RESTORER, 0x3cb64302d0}, 8) = 0 wait4(-1, Does this indicate any optional tuning or operational parameters that we should be using? hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Simultaneous reads and writes from specific apps to IPoIB volume seem to conflict and kill performance.
Some more info.. I think the problem is the way the bedtools is writing the output - it's not getting buffered correctly. Using some more useful strace flags to force strace into the fork'ed child, you can see that the output is being written, just very slowly due to the awful, horrible, skeezy, skanky, lazy, wanky way that biologists (me included) tend to write code. ie: after the data is read in and processed, you get gigantic amounts of this kind of output being written to the file [pid 17021] 21:56:21 write(1, U\t137095\t43\n, 12) = 12 0.000120 [pid 17021] 21:56:21 write(1, U\t137096\t40\n, 12) = 12 0.000119 [pid 17021] 21:56:21 write(1, U\t137097\t40\n, 12) = 12 0.000119 [pid 17021] 21:56:21 write(1, U\t137098\t40\n, 12) = 12 0.000119 [pid 17021] 21:56:21 write(1, U\t137099\t38\n, 12) = 12 0.000116 [pid 17021] 21:56:21 write(1, U\t137100\t38\n, 12) = 12 0.000119 [pid 17021] 21:56:21 write(1, U\t137101\t38\n, 12) = 12 0.000117 ie (the file itself): ... 137098 U 137098 40 137099 U 137099 38 137100 U 137100 38 137101 U 137101 38 137102 U 137102 36 IT looks like the current gluster config isn't being set up to buffer this particular output correctly, so it's being written on a write-by-write basis. As noted below, my gluster performance options are: performance.cache-size: 268435456 performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 Is there an option to address this extremely slow write perf? These options (p 38 of the 'Gluster File System 3.3.0 Administration Guide') sound like they may help but without knowing what they actually do, I'm hesitant to apply them to what is now a live fs. performance.flush-behind: If this option is set ON, instructs write-behind translator to perform flush in background, by returning success (or any errors, if any of previous writes were failed) to application even before flush is sent to backend filesystem. performance.write-behind-window-size Size of the per-file write-behind buffer. Advice? hjm On Mon, Jul 23, 2012 at 4:59 PM, Harry Mangalam hjmanga...@gmail.com wrote: I have fairly new gluster fs of 4 nodes with 2 RAID6 bricks on each node connected to a cluster via IPoIB on QDR IB. The servers are all SL6.2, running gluster 3.3-1; the clients are running the gluster-released glusterfs-fuse-3.3.0qa42-1 glusterfs-3.3.0qa42-1. The volume seems normal: $ gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.cache-size: 268435456 nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* The logs on both the server and client are remarkable in their lack of anything amiss (the server has the previously reported zillion times repeating string of: I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now which seems to be correlated with turning the NFS server off. This has been mentioned before. The gluster volume log, stripped of that line is here: http://pastie.org/4309225 Individual large-file reads and writes are in the 300MB/s range which is not magnificent but tolerable. However, we've recently detected what appears to be a conflict in reading and writing for some applications. When some applications are reading and writing to the gluster fs, the client /usr/sbin/glusterfs increases its CPU consunmption to 100% and the IO goes to almost zero. When the inputs are on the gluster fs and the output is on another fs, performance is as good as on a local RAID. This seems to be specific to a particular application (bedtools, perhaps some other openmp genomics apps - still checking). Other utilities (cp, perl, tar, and other utilities ) that read and write to the gluster filesystem seem to be able to push and pull fairly large amount of data to/from it. The client is running a genomics utility (bedtools) which reads a very large chunks of data from the gluster fs, then aligns it to a reference genome. Stracing the run yields this stanza, after which it hangs until I kill it. The user has said that it does complete but at a speed hundreds of times slower (maybe timing out at each step..?) open(/data/users/tdlong/bin/genomeCoverageBed, O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffcf0e5bb0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(3, 0, SEEK_CUR) = 0 read(3, #!/bin/sh\n${0%/*}/bedtools genom..., 80) = 42 lseek(3, 0, SEEK_SET) = 0 getrlimit(RLIMIT_NOFILE, {rlim_cur
Re: [Gluster-users] Distributed replicated volume
I think the strong consensus of gluster users would be that this is the wrong filesystem to use for what you propose. gluster has a lot to rec it but using it for mail or other ZOTfiles (Zillions Of Tinyfiles) will bring nothing but astonishingly poor performance, tearing of hair, rending of clothes, pain, and heartbreak. hjm On Wed, Jul 18, 2012 at 12:43 PM, Gandalf Corvotempesta gandalf.corvotempe...@gmail.com wrote: Hi, i'm planning to develop a 4 node distributed replicated infrastructure with Gluster 3.3 I'll use 4 DELL R510 with 8x 2TB SATA disks each, giving us 32TB raw redundant and distributed capacity. Some questions: - in the near future we will add 2 other identical nodes, will be possibile to extend the gluster volume going up to 32+16GB or raw capacity? - What do you suggest, RAID5 or no-raid (one disk for each brick)? Our primary use will be mail and web servers, so many small files. Any other advice? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] according to mtab, GlusterFS is already mounted
Do you have some dangling symlinks? /home - /home2 (or vice versa) ie ls -ld /home* what does 'mount' or /etc/mtab say? (assuming that the '_' are supposed to be spaces; if not, all bets are off) hjm On Thu, Jul 5, 2012 at 1:41 AM, Jon Tegner teg...@renget.se wrote: Hi, I want to mount from two different gluster-filesystems, according to the following lines in fstab: server1:glusterStore1___/home1__glusterfs___defaults,_netdev,transport=rdma___0_0 server2:/glusterStore2___/home2__glusterfs___defaults,_netdev,transport=rdma___0_0 However, when the second one is mounted I get the following message: /sbin/mount.glusterfs: according to mtab, GlusterFS is already mounted on /home However, both system seems to be mounted OK, am I doing something terribly wrong here, or can I just disregard this message? On server1 I'm using 3.2.3-1. On the client and server2 its 3.2.6-1. Regards, /jon ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster-3.3 Puzzler
which OS are you using? I believe 3.3 will install but won't run on older CentOSs (5.7/5.8) due to libc skew. and you did 'modprobe fuse' before you tried to mount it...? hjm On Wed, Jun 27, 2012 at 12:46 PM, Robin, Robin rob...@muohio.edu wrote: Hi, Just updated to Gluster-3.3; I can't seem to mount my initial test volume. I did the mount on the gluster server itself (which works on Gluster-3.2). # rpm -qa | grep -i gluster glusterfs-fuse-3.3.0-1.el6.x86_64 glusterfs-server-3.3.0-1.el6.x86_64 glusterfs-3.3.0-1.el6.x86_64 # gluster volume info all Volume Name: vmvol Type: Replicate Volume ID: b105560a-e157-4b94-bac9-39378db6c6c9 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: mualglup01:/mnt/gluster/vmvol001 Brick2: mualglup02:/mnt/gluster/vmvol001 Options Reconfigured: auth.allow: 127.0.0.1,134.53.*,10.* ##Â mount -t glusterfs mualglup01.mcs.muohio.edu:vmvol /mnt/test (did this on the gluster machine itself) I'm getting the following in the logs: +--+ [2012-06-27 15:40:52.116160] I [rpc-clnt.c:1660:rpc_clnt_reconfig] 0-vmvol-client-0: changing port to 24009 (from 0) [2012-06-27 15:40:52.116479] I [rpc-clnt.c:1660:rpc_clnt_reconfig] 0-vmvol-client-1: changing port to 24009 (from 0) [2012-06-27 15:40:56.055124] I [client-handshake.c:1636:select_server_supported_programs] 0-vmvol-client-0: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-27 15:40:56.055575] I [client-handshake.c:1433:client_setvolume_cbk] 0-vmvol-client-0: Connected to 10.0.72.132:24009, attached to remote volume '/mnt/gluster/vmvol001'. [2012-06-27 15:40:56.055610] I [client-handshake.c:1445:client_setvolume_cbk] 0-vmvol-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-06-27 15:40:56.055682] I [afr-common.c:3627:afr_notify] 0-vmvol-replicate-0: Subvolume 'vmvol-client-0' came back up; going online. [2012-06-27 15:40:56.055871] I [client-handshake.c:453:client_set_lk_version_cbk] 0-vmvol-client-0: Server lk version = 1 [2012-06-27 15:40:56.057871] I [client-handshake.c:1636:select_server_supported_programs] 0-vmvol-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-27 15:40:56.058277] I [client-handshake.c:1433:client_setvolume_cbk] 0-vmvol-client-1: Connected to 10.0.72.133:24009, attached to remote volume '/mnt/gluster/vmvol001'. [2012-06-27 15:40:56.058304] I [client-handshake.c:1445:client_setvolume_cbk] 0-vmvol-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-06-27 15:40:56.063514] I [fuse-bridge.c:4193:fuse_graph_setup] 0-fuse: switched to graph 0 [2012-06-27 15:40:56.063638] I [client-handshake.c:453:client_set_lk_version_cbk] 0-vmvol-client-1: Server lk version = 1 [2012-06-27 15:40:56.063802] I [fuse-bridge.c:4093:fuse_thread_proc] 0-fuse: unmounting /mnt/test [2012-06-27 15:40:56.064207] W [glusterfsd.c:831:cleanup_and_exit] (--/lib64/libc.so.6(clone+0x6d) [0x35f0ce592d] (--/lib64/libpthread.so.0() [0x35f14077f1] (--/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405cfd]))) 0-: received signum (15), shutting down [2012-06-27 15:40:56.064250] I [fuse-bridge.c:4643:fini] 0-fuse: Unmounting '/mnt/test'. The server and client should be the same version (as attested by the rpm). I've seen that some other people are getting the same errors in the archive; no solutions were offered. Any help is appreciated. Thanks, Robin ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Fedora 17 GlusterFS 3.3.0 problmes
Were you automounting your glusterfs? This is extremely similar to what I described previously: http://comments.gmane.org/gmane.comp.file-systems.gluster.user/9241 In my case the servers were running SL6.2; clients were running CentOS 5.7 and were automounting the glusterfs with IPoIB. When we switched to hard mounts, the problem seems to have gone away, but I'd be interested in seeing this resolved. The Client lk-version numbers are not same, reopening the fds strings in the logs were also identical, even when I compiled new client versions that matched the server version. If this does not have a trivial answer, I'll be happy to file an official bug report. One report could be a twit :). second one could be a bug. hjm On Fri, Jun 22, 2012 at 7:49 AM, Nathan Stratton nat...@robotics.net wrote: When I do a NFS mount and do a ls I get: [root@ovirt share]# ls ls: reading directory .: Too many levels of symbolic links [root@ovirt share]# ls -fl ls: reading directory .: Too many levels of symbolic links total 3636 drwxr-xr-x  3 root root 16384 Jun 21 19:34 . dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 .. drwxr-xr-x  3 root root 16384 Jun 21 19:34 . dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 .. drwxr-xr-x  3 root root 16384 Jun 21 19:34 . dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 .. drwxr-xr-x  3 root root 16384 Jun 21 19:34 . dr-xr-xr-x. 21 root root  4096 Jun 21 19:29 .. drwxr-xr-x  3 root root 16384 Jun 21 19:34 . {and so on} When I try a fuse mount I get: [client-handshake.c:1445:client_setvolume_cbk] 0-share-client-0: Server and Client lk-version numbers are not same, reopening the fds I have tried the fedora 16 RPMs and also built new fedora 17 RPMs. Full Log: [2012-06-21 19:24:35.633510] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.3.0 [2012-06-21 19:24:35.646832] I [io-cache.c:1549:check_cache_size_ok] 0-share-quick-read: Max cache size is 16825450496 [2012-06-21 19:24:35.646916] I [io-cache.c:1549:check_cache_size_ok] 0-share-io-cache: Max cache size is 16825450496 [2012-06-21 19:24:35.660807] I [client.c:2142:notify] 0-share-client-0: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.664026] I [client.c:2142:notify] 0-share-client-1: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.666801] I [client.c:2142:notify] 0-share-client-2: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.669385] I [client.c:2142:notify] 0-share-client-3: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.671951] I [client.c:2142:notify] 0-share-client-4: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.674514] I [client.c:2142:notify] 0-share-client-5: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.677093] I [client.c:2142:notify] 0-share-client-6: parent translators are ready, attempting connect on transport [2012-06-21 19:24:35.679652] I [client.c:2142:notify] 0-share-client-7: parent translators are ready, attempting connect on transport Given volfile: +--+  1: volume share-client-0  2:   type protocol/client  3:   option remote-host virt01.casabi.net  4:   option remote-subvolume /export  5:   option transport-type tcp  6:   option username f0a71b3a-1cb2-4a38-8620-f3853aa90798  7:   option password 9641cac2-de93-4621-b94b-9320ada20fd7  8: end-volume  9:  10: volume share-client-1  11:   type protocol/client  12:   option remote-host virt02.casabi.net  13:   option remote-subvolume /export  14:   option transport-type tcp  15:   option username f0a71b3a-1cb2-4a38-8620-f3853aa90798  16:   option password 9641cac2-de93-4621-b94b-9320ada20fd7  17: end-volume  18:  19: volume share-client-2  20:   type protocol/client  21:   option remote-host virt03.casabi.net  22:   option remote-subvolume /export  23:   option transport-type tcp  24:   option username f0a71b3a-1cb2-4a38-8620-f3853aa90798  25:   option password 9641cac2-de93-4621-b94b-9320ada20fd7  26: end-volume  27:  28: volume share-client-3  29:   type protocol/client  30:   option remote-host virt04.casabi.net  31:   option remote-subvolume /export  32:   option transport-type tcp  33:   option username f0a71b3a-1cb2-4a38-8620-f3853aa90798  34:   option password 9641cac2-de93-4621-b94b-9320ada20fd7  35: end-volume  36:  37: volume share-client-4  38:   type protocol/client  39:   option remote-host virt05.casabi.net  40:   option remote-subvolume /export  41:   option transport-type tcp  42:   option username f0a71b3a-1cb2-4a38-8620-f3853aa90798  43:   option password 9641cac2-de93-4621-b94b-9320ada20fd7  44: end-volume  45:  46: volume share-client-5  47:   type
Re: [Gluster-users] shrinking volume
In 3.3, this is exactly what 'remove-brick' does. It migrates the data off an active volume, and when it's done, allows the removed brick to be upgraded, shut down, killed off etc. gluster volume remove-brick vol server:/brick start (takes a while to start up, but then goes fairly rapidly.) following is the result of a recent remove-brick I did with 3.3 % gluster volume remove-brick gl bs1:/raid1 status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost2 10488397779 120in progress bs20000not started bs30000not started bs40000not started (time passes ) $ gluster volume remove-brick gl bs1:/raid2 status Node Rebalanced-files size scanned failures status - --- --- --- --- localhost 952 26889337908 83060 completed Note that once the 'status' says completed, you need to issue the remove-brick command again to actually finalize the operation. And that 'remove-brick' command will not clear the dir structure on the removed brick. On Thu, 2012-06-21 at 12:29 -0400, Brian Cipriano wrote: Hi all - is there a safe way to shrink an active gluster volume without losing files? I've used remove-brick before, but this causes the files on that brick to be removed from the volume. Which is fine for some situations. But I'm trying to remove a brick without losing files. This is because our file usage can grow dramatically over short periods. During those times we add a lot of buffer to our gluster volume, to keep it at about 50% usage. After things settle down and file usage isn't changing as much, we'd like to remove some bricks in order to keep usage at about 80%. (These bricks are AWS EBS volumes - we want to remove the bricks to save a little $ when things are slow.) So what I'd like to do is the following. This is a simple distributed volume, no replication. * Let gluster know I want to remove a brick * No new files will go to that brick * Gluster starts copying files from that brick to other bricks, essentially rebalancing the data * Once all files have been duplicated onto other bricks, the brick is marked as removed and I can do a normal remove-brick * Over the course of this procedure the files are always available because there's always at least one active copy of every file This procedure seems very similar to replace-brick, except the goal would be to evenly distribute to all other active bricks (without interfering with pre-exiting files), not one new brick. Is there any way to do this? I *could* just do my remove-brick, then manually distribute the files from that old brick back onto the volume, but that would cause those files to become unavailable for some amount of time. Many thanks for all your help, - brian ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] How Fatal? Server and Client lk-version numbers are not same, reopening the fds
Despite Joe Landman's sage advice to the contrary, I'm trying to convince an IPoIB volume to service requests from a GbE client via some /etc/hosts manipulation. (This may or may not be related to the automount problems we're having as well.) This has worked (and continues to work) well on another cluster with a slightly older version of gluster - the 3.3.0qa42 version on both server and client). In the following case the servers are on IPoIB (net 10.2.x.x) and GbE (10.1.x.x) and can ping back and forth on their respective networks to all the clients and servers. The gluster volume was created using the IPoIB network and numbers: Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: nfs.disable: on performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* using 3.3 servers on SciLi 6.2 and 3.3.0qa42 clients. When trying to mount the gluster volume on an ethernet client, coerced into believing that the server is on ethernet using /etc/hosts manipulations, it doesn't complete the mount, failing rapidly with the following log: http://pastie.org/4123348. The server log doesn't seem to show anything. There are repeated references to Server and Client lk-version numbers are not same, reopening the fds. Is this a fatal error or a side effect? How important is having exactly matching versions? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] How Fatal? Server and Client lk-version numbers are not same, reopening the fds
To check whether the point version skew might have an effect, I compiled gluster from the source: http://download.gluster.org/pub/gluster/glusterfs/LATEST/glusterfs-3.3.0.tar.gz and tried it again. However, even tho the server and client are now from (I assume) the same source, I still get that error: [2012-06-20 17:13:52.084846] I [client-handshake.c:453:client_set_lk_version_cbk] 0-gl-client-6: Server lk version = 1 [2012-06-20 17:13:52.087352] I [client-handshake.c:1636:select_server_supported_programs] 0-gl-client-5: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) The server was intalled from the CentOS binary of that version: http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-server-3.3.0-1.el6.x86_64.rpm the client from the self-compiled code. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Too many levels of symbolic links with glusterfs automounting
(Apologies if this already posted, but I recently had to change smtp servers which scrambled some list permissions, and I haven't seen it post) I set up a 3.3 gluster volume for another sysadmin and he has added it to his cluster via automount. It seems to work initially but after some time (days) he is now regularly seeing this warning: Too many levels of symbolic links when he tries to traverse the mounted filesystems. $ df: `/share/gl': Too many levels of symbolic links It's supposed to be mounted on /share/gl with a symlink to /gl ie: /gl - /share/gl I've been using gluster with static mounts on a cluster and have never seen this behavior; google does not seem to record anyone else seeing this with gluster. However, I note that the Howto Automount GlusterFS page at http://www.gluster.org/community/documentation/index.php/Howto_Automount_GlusterFS has been deleted. Is automounting no longer supported? His auto.master file is as follows (sorry for the wrapping): w1 -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.2:/ w2 -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.3:/ mathbio -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.2:/ tw -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.4:/ shwstore -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async shwraid.biomol.uci.edu:/ djtstore -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async djtraid.biomol.uci.edu:/ djtstore2 -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async djtraid2.biomol.uci.edu:/djtraid2:/ djtstore3 -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async djtraid3.biomol.uci.edu:/djtraid3:/ kevin -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.2.255.230:/ samlab -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.2.255.237:/ new-data -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async nas-1-1.ib:/ gl-fstype=glusterfs bs1:/ He has never seen this behavior with the other automounted fs's. The system logs from the affected nodes do not have any gluster strings that appear to be relevant, but /var/log/glusterfs/share-gl.log ends with this series of odd lines: [2012-06-18 08:57:38.964243] I [client-handshake.c:453:client_set_lk_version_cbk] 0-gl-client-6: Server lk version = 1 [2012-06-18 08:57:38.964507] I [fuse-bridge.c:3376:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.16 [2012-06-18 09:16:48.692701] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 09:16:48.693030] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 09:16:48.693165] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 09:16:48.693394] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 10:56:32.756551] I [fuse-bridge.c:4037:fuse_thread_proc] 0-fuse: unmounting /share/gl [2012-06-18 10:56:32.757148] W [glusterfsd.c:816:cleanup_and_exit] (--/lib64/libc.so.6(clone+0x6d) [0x3829ed44bd] (--/lib64/libpthread.so.0 [0x382aa0673d] (--/usr/sbin/glusterfs(glusterfs_sigwaiter+0x17c) [0x40524c]))) 0-: received signum (15), shutting down Any hints as to why this is happening? ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Too many levels of symbolic links with glusterfs automounting
One client log file is here: http://goo.gl/FyYfy On the server side, on bs1 bs4, there is a huge, current nfs.log file (odd since I neither wanted nor configured an nfs export). It is filled entirely with these lines: tail -5 nfs.log [2012-06-19 21:11:54.402567] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-1: tcp connect to failed (Connection refused) [2012-06-19 21:11:54.406023] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-2: tcp connect to failed (Connection refused) [2012-06-19 21:11:54.409486] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-3: tcp connect to failed (Connection refused) [2012-06-19 21:11:54.412822] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-6: tcp connect to 10.2.7.11:24008 failed (Connection refused) [2012-06-19 21:11:54.416231] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-7: tcp connect to 10.2.7.11:24008 failed (Connection refused) on servers bs2, bs3 there is a current, huge log of this line, repeating every 3s: [2012-06-19 21:14:00.907387] I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now I was reminded as I was copying it that the client and servers are slightly different - the client is 3.3.0qa42-1 while the server is 3.3.0-1. Is this enough version skew to cause a difference? There are no other problems that I'm aware of but if it's the case that a slight version skew will be problematic, I'll be careful to keep them exactly aligned. I think this was done since the final release binary did not support the glibc that we were usin gon the compute nodes and the 3.3.0qa42-1 did. Perhaps too sloppy...? gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* gluster volume status Status of volume: gl Gluster process PortOnline Pid -- Brick bs2:/raid124009 Y 2908 Brick bs2:/raid224011 Y 2914 Brick bs3:/raid124009 Y 2860 Brick bs3:/raid224011 Y 2866 Brick bs4:/raid124009 Y 2992 Brick bs4:/raid224011 Y 2998 Brick bs1:/raid124013 Y 10122 Brick bs1:/raid224015 Y 10154 NFS Server on localhost 38467 Y 9475 NFS Server on 10.2.7.11 38467 Y 10160 NFS Server on bs2 38467 N N/A NFS Server on bs3 38467 N N/A Hmm sure enough, bs1 and bs4 (localhost in the above info) appear to be running NFS servers, while bs2 bs3 are not...? OK - after some googling, the gluster nfs serive can be shut off with gluster volume set gl nfs.disable on and now the status looks like this: gluster volume status Status of volume: gl Gluster process PortOnline Pid -- Brick bs2:/raid124009 Y 2908 Brick bs2:/raid224011 Y 2914 Brick bs3:/raid124009 Y 2860 Brick bs3:/raid224011 Y 2866 Brick bs4:/raid124009 Y 2992 Brick bs4:/raid224011 Y 2998 Brick bs1:/raid124013 Y 10122 Brick bs1:/raid224015 Y 10154 hjm On Tue, 2012-06-19 at 13:05 -0700, Anand Avati wrote: Can you post the complete logs? Is the 'Too many levels of symbolic links' (or ELOOP) logs seen in the client log or brick logs? Avati On Tue, Jun 19, 2012 at 11:22 AM, harry mangalam hjmanga...@gmail.com wrote: (Apologies if this already posted, but I recently had to change smtp servers which scrambled some list permissions, and I haven't seen it post) I set up a 3.3 gluster volume for another sysadmin and he has added it to his cluster via automount. It seems to work initially but after some time (days) he is now regularly seeing this warning: Too many levels of symbolic links when he tries to traverse the mounted filesystems. $ df: `/share
[Gluster-users] Too many levels of symbolic links with glusterfs automounting
I set up a 3.3 gluster volume for another sysadmin and he has added it to his cluster via automount. It seems to work initially but after some time (days) he is now regularly seeing this warning: Too many levels of symbolic links $ df: `/share/gl': Too many levels of symbolic links when he tries to traverse the mounted filesystems. I've been using gluster with static mounts on a cluster and have never seen this behavior; google does not seem to record anyone else seeing this with gluster. However, I note that the Howto Automount GlusterFS page at http://www.gluster.org/community/documentation/index.php/Howto_Automount_GlusterFS has been deleted. Is automounting no longer supported? His auto.master file is as follows (sorry for the wrapping): w1 -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.2:/ w2 -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.3:/ mathbio -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.2:/ tw -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.4:/ shwstore -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async shwraid.biomol.uci.edu:/ djtstore -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async djtraid.biomol.uci.edu:/ djtstore2 -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async djtraid2.biomol.uci.edu:/djtraid2:/ djtstore3 -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async djtraid3.biomol.uci.edu:/djtraid3:/ kevin -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.2.255.230:/ samlab -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.2.255.237:/ new-data -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async nas-1-1.ib:/ gl-fstype=glusterfs bs1:/ He has never seen this behavior with the other automounted fs's. The system logs from the affected nodes do not have any gluster strings that appear to be relevant, but /var/log/glusterfs/share-gl.log ends with this series of odd lines: [2012-06-18 08:57:38.964243] I [client-handshake.c:453:client_set_lk_version_cbk] 0-gl-client-6: Server lk version = 1 [2012-06-18 08:57:38.964507] I [fuse-bridge.c:3376:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.16 [2012-06-18 09:16:48.692701] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 09:16:48.693030] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 09:16:48.693165] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 09:16:48.693394] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: remote operation failed: Stale NFS file handle. Path: /tdlong/RILseq/makebam.commands (90193380-d107-4b6c-b02f-ab53a0f65148) [2012-06-18 10:56:32.756551] I [fuse-bridge.c:4037:fuse_thread_proc] 0-fuse: unmounting /share/gl [2012-06-18 10:56:32.757148] W [glusterfsd.c:816:cleanup_and_exit] (--/lib64/libc.so.6(clone+0x6d) [0x3829ed44bd] (--/lib64/libpthread.so.0 [0x382aa0673d] (--/usr/sbin/glusterfs(glusterfs_sigwaiter+0x17c) [0x40524c]))) 0-: received signum (15), shutting down ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] remove-brick redux; repeat as necessary
I had the opportunity (actually, desperate requirement) to try this again on a newly live system and while it worked again, it required 2 remove-brick statements to actually get the volume to drop it. The first apparently did the data moving, the second was necessary to tell the volume to drop the brick. Since I had 2 bricks to drop, and re-add, it was repeatable. Not really serious enough for a bug report, but .. interesting .. to experience. -- Harry Mangalam,Research Computing, OIT, Rm 225 MSTB, UC Irvine [mailcode 2225] Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] Lat/Long: (33.642025,-117.844414) [paste into Google Maps] -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] too many redirects at the gluster download page
It may be just me/chrome, but trying to dl the latest gluster results by clicking on the Download button next to the Ant, leads not to a download page but to the info page. It invites you to go back to the gluster.org page from when you just came. And when you click on the alternative 'Download' links (the button on the upper right or the larger Download GlusterFS icon with the package image, you get this in Chrome: This webpage has a redirect loop The webpage at http://www.gluster.org/download/ has resulted in too many redirects. Clearing your cookies for this site or allowing third-party cookies may fix the problem. If not, it is possibly a server configuration issue and not a problem with your computer. Bug or feature? -- Harry Mangalam,Research Computing, OIT, Rm 225 MSTB, UC Irvine [mailcode 2225] Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] Lat/Long: (33.642025,-117.844414) [paste into Google Maps] -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] unable to delete 'empty' dirs with 3.3qa42
Just before updrading to 3.3final, we had an rsync collision on our gluster filesystem which left us with undeletable dirs. The transport is IPoIB over 4 bricks as shown below. $ gluster volume info Volume Name: gli Type: Distribute Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4 Status: Started Number of Bricks: 4 Transport-type: tcp,rdma Bricks: Brick1: pbs1ib:/bducgl Brick2: pbs2ib:/bducgl Brick3: pbs3ib:/bducgl Brick4: pbs4ib:/bducgl Options Reconfigured: performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 The last 100 line of the gluster log is here: http://pastie.org/4036243 The problem started when I was running an rsync with --delete while moving a large /home to the gluster dir to prune the target of older cruft in a /home move.. A user directed a job to start reading from a directory that was simultaneously being pruned and all hell broke loose. Now we're left with a number of dirs and files that can't be deleted. 'du' says there are files in an apparently empty dir: root@bduc-login:/gl/tvanerp/adni 561 $ du -sh * 40K dicom altho 'ls' can't see them from the client; root@bduc-login:/gl/tvanerp/adni 562 $ ls -lat dicom total 40 drwxrwxr-x 3 tvanerp tvanerp 94214 Jun 5 23:40 ./ drwxrwxr-x 3 tvanerp tvanerp72 Apr 23 13:11 ../ From the bricks you can see that there are still files there: Tue Jun 05 23:18:02 [3.06 2.42 1.65][457.48/606] root@bduc-login:~ 556 $ ssh pbs2 'ls -lat /bducgl/tvanerp/adni/dicom' total 24 drwxrwxr-x 3 7335 7335 28672 2012-06-05 22:59 . drwxr-xr-x 2 7335 7335 8192 2012-06-05 14:50 128_S_0947_20080422_MPRAGE drwxrwxr-x 3 7335 733518 2012-04-23 13:11 .. Tue Jun 05 23:18:26 [2.82 2.42 1.67][457.62/606] root@bduc-login:~ 557 $ ssh pbs3 'ls -lat /bducgl/tvanerp/adni/dicom' total 28 drwxrwxr-x 3 7335 7335 28672 2012-06-05 23:00 . drwxr-xr-x 2 7335 7335 12288 2012-06-05 14:52 128_S_0947_20080422_MPRAGE drwxrwxr-x 3 7335 733518 2012-04-23 13:11 .. Tue Jun 05 23:18:51 [2.73 2.43 1.69][457.7/606] root@bduc-login:~ 558 $ ssh pbs4 'ls -lat /bducgl/tvanerp/adni/dicom' total 32 drwxrwxr-x 3 7335 7335 36864 2012-06-05 22:59 . drwxr-xr-x 2 7335 7335 12288 2012-06-05 14:50 128_S_0947_20080422_MPRAGE -T 2 root root 0 2012-05-20 18:35 adni_pib_subjects.txt drwxrwxr-x 3 7335 733518 2012-04-23 13:11 .. but the client is not able to see/delete them, even as root. Suggestions? harry -- Harry Mangalam,Research Computing, OIT, Rm 225 MSTB, UC Irvine [mailcode 2225] Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] Lat/Long: (33.642025,-117.844414) [paste into Google Maps] -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] A very special announcement from Gluster.org
WooHooo! Thanks, Glusteristas! Timing is quite fortuitous for us. hjm On 05/31/2012 09:33 AM, John Mark Walker wrote: Today, we're announcing the next generation of GlusterFS http://www.gluster.org/, version 3.3. The release has been a year in the making and marks several firsts: the first post-acquisition release under Red Hat, our first major act as an openly-governed project http://www.gluster.org/roadmaps/and our first foray beyond NAS. We've also taken our first steps towards merging big data and unstructured data storage, giving users and developers new ways of managing their data scalability challenges. GlusterFS is an open source, fully distributed storage solution for the world's ever-increasing volume of unstructured data. It is a software-only, highly available, scale-out, centrally managed storage pool that can be backed by POSIX filesystems that support extended attributes, such as Ext3/4, XFS, BTRFS and many more. This release provides many of the most commonly requested features including proactive self-healing, quorum enforcement, and granular locking for self-healing, as well as many additional bug fixes and enhancements. Some of the more noteworthy features include: * Unified File and Object storage -- Blending OpenStack's Object Storage API http://openstack.org/projects/storage/ with GlusterFS provides simultaneous read and write access to data as files or as objects. * HDFS compatibility -- Gives Hadoop administrators the ability to run MapReduce jobs on unstructured data on GlusterFS and access the data with well-known tools and shell scripts. * Proactive self-healing -- GlusterFS volumes will now automatically restore file integrity after a replica recovers from failure. * Granular locking -- Allows large files to be accessed even during self-healing, a feature that is particularly important for VM images. * Replication improvements -- With quorum enforcement you can be confident that your data has been written in at least the configured number of places before the file operation returns, allowing a user-configurable adjustment to fault tolerance vs performance. * *Visit http://www.gluster.org http://gluster.org/ to download. Packages are available for most distributions, including Fedora, Debian, RHEL, Ubuntu and CentOS. Get involved! Join us on #gluster on freenode, join our mailing list http://www.gluster.org/interact/mailinglists/, 'like' our Facebook page http://facebook.com/GlusterInc, follow us on Twitter http://twitter.com/glusterorg, or check out our LinkedIn group http://www.linkedin.com/groups?gid=99784. GlusterFS is an open source project sponsored by Red Hat http://www.redhat.com/®, who uses it in its line of Red Hat Storage http://www.redhat.com/storage/ products. (this post published at http://www.gluster.org/2012/05/introducing-glusterfs-3-3/ ) ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam,Research Computing, OIT, Rm 225 MSTB, UC Irvine [mailcode 2225] Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] Lat/Long: (33.642025,-117.844414) [paste into Google Maps] -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] 'remove-brick' is removing more bytes than are in the brick(?)
Installed the qa42 version on servers and clients and under load, it worked as advertised (tho of course more slowly than I would have liked :)) - removed ~1TB in just under 24 hr (on a DDR-IB connected 4 node set) ~ 40MB/s overall tho there were a huge number of tiny files. The remove-brick cleared the brick (~1TB), tho with an initial set of 120 failures (what does this mean?) Rebalanced Node files size scanned failures status timestamp -- --- --- --- pbs2ib 15676 69728541188365886 120in progress May 22 17:33:14 pbs2ib 24844 134323243354449667 120in progress May 22 18:08:56 pbs2ib 37937 166673066147714175 120in progress May 22 19:08:21 pbs2ib 42014 173145657374806556 120in progress May 22 19:33:21 pbs2ib 418842 222883965887 5729324 120in progress May 23 07:15:19 pbs2ib 419148 222907742889 5730903 120in progress May 23 07:16:26 pbs2ib 507375 266212060954 6192573 120in progress May 23 09:48:05 pbs2ib 540201 312712114570 6325234 120in progress May 23 11:15:51 pbs2ib 630332 416533679754 6633562 120in progress May 23 14:24:16 pbs2ib 644156 416745820627 6681746 120in progress May 23 14:45:44 pbs2ib 732989 432162450646 7024331 120 completed May 23 17:26:20 (sorry for any wrapping) and finally deleted the files: root@pbs2:~ 404 $ df -h Filesystem Size Used Avail Use% Mounted on /dev/md08.2T 1010G 7.2T 13% /bducgl - retained brick /dev/sda1.9T 384M 1.9T 1% /bducgl1 - removed brick altho it left the directory skeleton (is this a bug or feature?): root@pbs2:/bducgl1 406 $ ls aajames aelsadek amentes avuong1 btatevos chiaoyic dbecerra aamelire aganesan anasr awaring bvillac clarkap dbkeator aaskariz agkentanhml balakire calvinjs cmarcum dcs abanaiya agold argardne bgajare casem cmarkega dcuccia aboessen ahnsh arup biggsjcbatmall courtnem detwiler abondar aihlerasidhwa binz cesar crex dgorur abraatz aisenber asuncion bjanakal cestark cschendhealion abriscoe akathaatenner blind cfalvoculverr dkyu abuschalai2 atfrank blutescgalasso daliz dmsmith acohanalamngathinabmmiller cgarner danieldmvuong acstern allisons athsu bmobashe chadwicr dariusa dphillip ademirta almquist aveidlab brentmchangd1 dasherdshanthi etc And once completed with the 'commit' command, it no longer reports the brick as part of the volume: $ gluster volume info gli Volume Name: gli Type: Distribute Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4 Status: Started Number of Bricks: 4 Transport-type: tcp,rdma Bricks: Brick1: pbs1ib:/bducgl Brick2: pbs2ib:/bducgl -- no more pbs2ib:/bducgl Brick3: pbs3ib:/bducgl Brick4: pbs4ib:/bducgl Options Reconfigured: performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 And no longer reports the removed brick as part of the gluster volume: $ gluster volume status Status of volume: gli Gluster process PortOnline Pid --- Brick pbs1ib:/bducgl 24016 Y 10770 Brick pbs2ib:/bducgl 24025 Y 1788 Brick pbs3ib:/bducgl 24018 Y 20953 Brick pbs4ib:/bducgl 24009 Y 20948 So this was a big improvement over the previous trial. the only glitches were the 120 failures (which mean...?) and the directory skeleton left on the removed brick which may be a feature..? So it seems to have been fixed in qa42. thanks! hjm On Tuesday 22 May 2012 00:02:02 Amar Tumballi wrote: pbs2ib 8780091379699182236 2994733 in progress Hi Harry, Can you please test once again with 'glusterfs-3.3.0qa42' and confirm the behavior? This seems like a bug (suspect it to be some overflow type of bug, not sure yet). Please help us with opening a bug report, meantime, we will investigate on this issue. Regards, Amar -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance issues with striped volume over Infiniband
try 'ifstat' to see traffic on all interfaces simultaneously: 501 $ ifstat eth2ib1 KB/s in KB/s out KB/s in KB/s out 0.80 0.70 0.00 0.00 0.19 0.15 0.00 0.00 0.07 0.15 0.00 0.00 pkg ifstat in debian/ubuntu. hjm On Saturday 21 April 2012 02:45:10 Ionescu, A. wrote: Michael, Thanks for your suggestion. I had the same intuition as you, but then I used iptraf and saw no eth0 traffic associated with I/O on the Gluster volume. (the tool doesn't show the ib0 interface, unfortunately). node01 and node02 are entered into /etc/hosts, they resolve to the ipoib addresses and are pingable. I will try increasing the number of threads and applying the patch Bryan suggested. Thanks, Adrian -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable
I was successfully running a IPoIB gluster testbed (V3.3b3 on Ubuntu 10.04.04) and brought it down smoothly to adjust some parameters. It now looks like this (the options reconfigured were just added): # gluster volume info Volume Name: gli Type: Distribute Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4 Status: Started Number of Bricks: 5 Transport-type: tcp,rdma Bricks: Brick1: pbs1ib:/bducgl Brick2: pbs2ib:/bducgl Brick3: pbs2ib:/bducgl1 Brick4: pbs3ib:/bducgl Brick5: pbs4ib:/bducgl Options Reconfigured: performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.255.77.*, 128.200.15.*, 10.255.78.*, 10.255.89.* however, a status query gives this: # gluster volume status Status of volume: gli Gluster process PortOnline Pid -- Brick pbs1ib:/bducgl24016 N N/A Brick pbs2ib:/bducgl24023 N N/A Brick pbs2ib:/bducgl1 24025 N N/A Brick pbs3ib:/bducgl24016 N N/A Brick pbs4ib:/bducgl24016 N N/A NFS Server on localhost 38467 N N/A NFS Server on pbs4ib38467 N N/A NFS Server on pbs3ib38467 N N/A NFS Server on pbs2ib38467 N N/A (I didn't want the NFS Server options - is that a default to start it?) But the operative bit is that it's not online, despite being started. What could give this situation? As might be expected, clients can't mount the gluster vol. The last part of the /etc-glusterfs-glusterd.vol.log/ is many lines like this: [2012-04-18 11:36:57.456318] E [socket.c:2115:socket_connect] 0- management: connection attempt failed (Connection refused) and the last lines before are a number of stanzas like this: [2012-04-18 11:31:14.698184] I [glusterd-op- sm.c::glusterd_op_ac_send_commit_op] 0-management: Sent op req to 3 peers [2012-04-18 11:31:14.698379] I [glusterd-rpc- ops.c:1294:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: 2a593581-bf45-446c-8f7c-212c53297803 [2012-04-18 11:31:14.698496] I [glusterd-rpc- ops.c:1294:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 [2012-04-18 11:31:14.698581] I [glusterd-rpc- ops.c:1294:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077 [2012-04-18 11:31:14.698834] I [glusterd-rpc- ops.c:606:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC from uuid: 2a593581-bf45-446c-8f7c-212c53297803 [2012-04-18 11:31:14.698879] I [glusterd-rpc- ops.c:606:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC from uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077 [2012-04-18 11:31:14.698910] I [glusterd-rpc- ops.c:606:glusterd3_1_cluster_unlock_cbk] 0-glusterd: Received ACC from uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 [2012-04-18 11:31:14.698929] I [glusterd-op- sm.c:2491:glusterd_op_txn_complete] 0-glusterd: Cleared local lock [2012-04-18 11:31:15.410106] E [socket.c:2115:socket_connect] 0- management: connection attempt failed (Connection refused) -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable
With JoeJulian's help, tracked this down to what looks like a bug in the IP# format which causes glusterfsd to crash. The bug is: https://bugzilla.redhat.com:443/show_bug.cgi?id=813937 If anyone has an immediate workaround or correction, be glad to hear of it. hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable
Interim fix is to use ONLY commas, no spaces allowed (this used to be oK previously) gluster volume set gli auth.allow \ '10.255.77.*,128.200.15.*,10.255.78.*,10.255.89.*' is ok (glusterfsd starts correctly) but gluster volume set gli auth.allow '10.255.77.*, 128.200.15.*, 10.255.78.*, 10.255.89.*' is NOT OK (glusterfsd will not start). hjm On Wednesday 18 April 2012 12:56:08 Harry Mangalam wrote: With JoeJulian's help, tracked this down to what looks like a bug in the IP# format which causes glusterfsd to crash. The bug is: https://bugzilla.redhat.com:443/show_bug.cgi?id=813937 If anyone has an immediate workaround or correction, be glad to hear of it. hjm -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] IPoIB Volume (3.3b3) started but not online, not mountable
And one more observation that will probably be obvious in retrospect. If you enable auth.allow (on 3.3b3), it will do reverse lookups to verify hostnames so it will be more complicated to share an IPoIB gluster volume to IPoEth clients. I had been overriding DNS entries with /etc/hosts entries, but the auth.allow option will prevent that hack. If anyone knows how to share an IPoIB volume to ethernet clients in a more formally correct way, I'd be happy to learn of it. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Best practice: Scaling down / Removing Servers
As I read the new 3.3b3 FAQ, yes. The remove-brick feature can do this (altho I have not yet tried it - maybe later this next week). http://community.gluster.org/q/what-s-new-in-glusterfs-3-3/ hjm On Saturday 14 April 2012 02:36:41 Philip wrote: We have some issues with the RAID-Controller of our gluster servers. We decided to buy complete new servers to replace the others but it is quite unclear how to perform this task without downtime. It is possible to add the new servers to the volume, rebalance it and then get all the data off the old servers and remove them after this? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster/RDMA
thanks for the advice. Those modules were not autoloaded, but could be loaded manually. I now have this profile: $ lsmod |egrep '(ib|rdma)' ib_umad12363 0 ib_ucm 12496 0 rdma_ucm 11779 0 ib_uverbs 31902 2 ib_ucm,rdma_ucm rdma_cm31030 1 rdma_ucm ib_cm 40316 2 ib_ucm,rdma_cm iw_cm 9385 1 rdma_cm ib_sa 22006 2 rdma_cm,ib_cm ib_addr 6766 1 rdma_cm zlib_deflate 21834 1 btrfs libcrc32c 1244 1 btrfs ib_mthca 140934 0 ib_mad 41321 4 ib_umad,ib_cm,ib_sa,ib_mthca ib_core64935 10 ib_umad,ib_ucm,rdma_ucm,ib_uverbs,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mthca,ib_mad and semi-magically, on one of the nodes, ibv_devinfo reports: root@pbs3:~ 661 $ ibv_devinfo hca_id: mthca0 transport: InfiniBand (0) fw_ver: 1.0.800 node_guid: 0006:6a00:9800:6e55 sys_image_guid: 0006:6a00:9800:6e55 vendor_id: 0x066a vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_023002 phys_port_cnt: 1 port: 1 state: PORT_INIT (2) max_mtu:2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 However, 2 other nodes are still mute - there's obviously some kind of software drift between them, but this gives me some purchase to figure other things out. Still haven't upgraded the firmware, but this allows me to extract the PSID and get the right FW image. Thanks! Harry On Monday 07 November 2011 06:44:06 Ben England wrote: To Harry Mangalam about Gluster/RDMA: make sure these modules are loaded # modprobe -v rdma_ucm # modprobe -v ib_uverbs # modprobe -v ib_ucm To run the subnet manager # modprobe -v ib_umad Make sure libibverbs and (libmlx4 or libmthca) RPMs are installed. I don't understand why they appropriate modules aren't loaded automatically. Could put something in /etc/modprobe.d/ to make this happen maybe? Infiniband should not require troubleshooting after 5-10 years of development, it should just work. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] forcing a brick to unload storage to replace disks?
How is this functionality invoked? I've now upgraded to 3.3b2, which I believe is the latest version available. hjm On Friday 09 December 2011 00:23:12 Amar Tumballi wrote: Is there a process whereby I can clear a brick by forcing the files to migrate to the other bricks? Hi Harry, This feature got committed to master branch (upstream) recently, with which a remove-brick will take care of migrating data out of the brick. This feature is not part of any of current 3.2.x (or earlier) releases. If you are in testing/validating phase, 3.3.0qa15 should have this feature for you. Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Citzens United: Democracy on meth - Walter Egan ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Problem: Gluster performance too high..
Thanks! bonnie++ was going to be my next step. I thought of doing the local create-local-files-and-copy, but it complicates the process a bit (he said, lazily sipping his mint julep). But the time to read a local copy would be fairly trivial compared to the network time, so this sounds like it would be a good follow-up. On Thursday 29 March 2012 12:40:23 Jeff White wrote: Maybe it's cheating by writing sparse files or something of the like because it knows it's all zeros? Create some files locally from /dev/urandom and copy them. I think you'll see much lower performance. Better yet, use bonnie++. Jeff White - Linux/Unix Systems Engineer University of Pittsburgh - CSSD -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Problem: Gluster performance too high..
On Thursday 29 March 2012 12:50:48 Harry Mangalam wrote: My assumption was that at some point there would be a 1Gb bottleneck, but if the packets were switched all the way thru, there would be a theoretical max related the number of gluster-server Gb ports (4). So until I approach 4Gb/s, I guess I wouldn't necessarily see this bottleneck. Is that correct? Adding a little bit more: Digging thru my raw numbers, the fastest completion of the script yeilds an effective thruput of ~478MB/s, a bit less than the theoretical max of 500MB/s (if my above mumblings were correct), so there may be something to this. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] gluster 3.31b1 destroying mountpoints?
Testing gluster 3.3b2 on Ubuntu 10.04.4 3.3b1 seemed to be fine in this regard, but I'm getting a very peculiar effect of trying to mount gluster with 3.3b2 $ mount -t glusterfs pbs1:/gl /gll Mount failed. Please check the log file for more details. $ ls -l / ... drwxr-xr-x 193 root root 12288 2012-03-22 13:19 etc d? ? ? ? ?? gl d? ? ? ? ?? gll drwxr-xr-x 403 root root 12288 2012-03-17 08:26 home ... the mount seems to be destroying the mountpoints. Both /gl and /gll were created for the mountpoint, and then destroyed by trying to mount the gluster volume on it. I did not use a transport option since it's supposed to default to socket. The package used on both servers and clients was the Debian pkg: glusterfs_3.3beta2-1_amd64_with_rdma.deb The volume was created with: gluster volume create gl \ transport tcp,rdma \ pbs1:/bducgl \ pbs2:/bducgl pbs2:/bducgl1 \ pbs3:/bducgl \ pbs4:/bducgl gluster volume set gl auth.allow 10.255.78.*,10.255.89.*,128.200.15.* gluster volume set gl performance.io-thread-count 64 (I'm also experimenting with IB transport but would like to test the TCP transport as well). and the server-side status was: root@pbs1:~# gluster volume info Volume Name: gl Type: Distribute Status: Created Number of Bricks: 5 Transport-type: tcp,rdma Bricks: Brick1: pbs1:/bducgl Brick2: pbs2:/bducgl Brick3: pbs2:/bducgl1 Brick4: pbs3:/bducgl Brick5: pbs4:/bducgl Options Reconfigured: performance.io-thread-count: 64 auth.allow: 10.255.78.*,10.255.89.*,xxx.xxx.xx.* on the client side, /var/log/glusterfs/gl.log says: [2012-03-22 13:08:52.903423] W [fuse-bridge.c:2280:fuse_statfs_cbk] 0- glusterfs-fuse: 33: ERR = -1 (Transport endpoint is not connected) [2012-03-22 13:08:53.681680] I [client.c:1885:client_rpc_notify] 0-gl- client-4: disconnected On the server side, there are many lines of the format: [2012-03-21 14:22:13.279799] W [rpc- transport.c:183:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to socket [2012-03-21 14:22:13.487349] I [cli-rpc- ops.c:1000:gf_cli3_1_set_volume_cbk] 0-cli: Received resp to set [2012-03-21 14:22:13.487548] I [input.c:46:cli_batch] 0-: Exiting with: 0 [2012-03-21 14:23:23.869580] W [rpc- transport.c:183:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to socket [2012-03-21 14:23:24.74471] I [cli-rpc- ops.c:1000:gf_cli3_1_set_volume_cbk] 0-cli: Received resp to set [2012-03-21 14:23:24.74668] I [input.c:46:cli_batch] 0-: Exiting with: 0 [2012-03-22 12:56:51.299987] W [rpc- transport.c:183:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to socket [2012-03-22 12:56:51.425471] I [cli-rpc- ops.c:413:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2012-03-22 12:56:51.425694] I [cli-rpc- ops.c:606:gf_cli3_1_get_volume_cbk] 0-: Returning: 0 [2012-03-22 12:56:51.462537] I [cli-rpc- ops.c:413:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2012-03-22 12:56:51.462649] I [cli-rpc- ops.c:606:gf_cli3_1_get_volume_cbk] 0-: Returning: 0 [2012-03-22 12:56:51.462663] I [input.c:46:cli_batch] 0-: Exiting with: 0 This can't be normal -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] should be: gluster 3.31b*2* destroying mountpoints?
Resolution: As Brian noted, and Haris alluded to, these strange dirs are chowned this way when a non-running glusterfs is mounted. When I started the gluster volume remotely, they immediately mounted, such that I had the same gluster volume mounted on both /gl and /gll (/gll was created so I could ignore the odd state of /gl for a while). I was able to umount the gluster vol from /gll and delete it normally. So now it seems to be working correctly. While this might be an edge case, I wonder if this might be detected and a relevant error emitted by the client (yes, easy for me to say..).. thanks. hjm On Thursday 22 March 2012 14:05:25 Brian Candler wrote: On Thu, Mar 22, 2012 at 01:56:49PM -0700, Harry Mangalam wrote: Previous email had a typo in Subject line. What do you mean by destroying the mountpoint? I have seen those d??? entries before (not with gluster). IIRC it's when a directory has 'read' but not 'execute' bits set. $ mkdir foo $ chmod foo/bar $ chmod 666 foo $ ls -l foo ls: cannot access foo/bar: Permission denied total 0 -? ? ? ? ?? bar -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Reducing a volume
Is this documented anywhere? I'm in a position where this would be very useful. I've posted a few queries about this as well and if the function exists, it's an anonymous one. Last year, Amar responded that: If you are in testing/validating phase, 3.3.0qa15 should have this feature for you. But there's no docs to point to how to do it. If the development docs could be made available as a wiki or something similar, we stunt- testers could provide feedback, examples of gotchas and edge cases, and what works as expected that might have value. Best wishes, Harry On Monday 13 February 2012 00:42:10 Brian Candler wrote: On Mon, Feb 13, 2012 at 12:37:20AM +0100, Arnold Krille wrote: the third issue I encountered today: How do I tell gluster to remove two bricks of my six-brick-two-replica volume without loosing data? From what I've read (but I've not tried it yet), this will be a feature in 3.3 http://community.gluster.org/q/what-s-new-in-glusterfs-3-3/ Remove-brick can migrate data to remaining bricks ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Advertising the device from which your email was sent may reflect poorly on your imagination, self-image, and/or technical ability. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Reducing a volume
On Monday 13 February 2012 10:12:13 John Mark Walker wrote: - Original Message - But there's no docs to point to how to do it. If the development docs could be made available as a wiki or something similar, we stunt-testers could provide feedback, examples of gotchas and edge cases, and what works as expected that might have value. Greetings - yes, you are quite correct. We are in the process of documenting new features in 3.3 and putting them on the wiki. Are you volunteering to assist us in this process :) Yes, as implied in the post - I'd be happy to. Have you tried out any of the new QA builds to test? I'm running 3.3b1 and will be upgrading to the 3.3b2 later this week if there are no other disasters taking precedence.. If there's a place to get more recent versions, I'll try those as well. Best, Harry -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Advertising the device from which your email was sent may reflect poorly on your imagination, self-image, and/or technical ability. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Preliminary 3.3 Admin/User docs?
There have been quote a lot of hints and suggestions about what 3.3 will provide. I'm runnning 3.3b1, but am having trouble figuring out just what features it has and how to exploit them. In particular, the ability to force a brick to unload is apparently available in 3.3.0qa15 (ref below), but how does one go about doing this, other than reading source code? Are there preliminary Admin docs or a wiki where this info is being assembled? I looked thru the docs on the gluster site but there doesn't seem to be much in the way of docs such as the very useful but now a bit dated 'Gluster_FS_3.2_Admin_Guide.pdf' The Gluster Management Console is now in early release: http://download.gluster.org/pub/gluster/glustermc/1.0/Documentation/User_Guide/html/index.html and has some docs - is this what we're supposed to use to manipulate our glusterfs's? best wishes harry On Friday 09 December 2011 00:23:12 Amar Tumballi wrote: Is there a process whereby I can clear a brick by forcing the files to migrate to the other bricks? Hi Harry, This feature got committed to master branch (upstream) recently, with which a remove-brick will take care of migrating data out of the brick. This feature is not part of any of current 3.2.x (or earlier) releases. If you are in testing/validating phase, 3.3.0qa15 should have this feature for you. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Advertising the device from which your email was sent may reflect poorly on your imagination, self-image, and/or technical ability. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] what happens when adding a pre-populated brick?
Thanks very much for the answer, Jeff. Lacking further info, I'll hang out on the gluster irc and over the break, try it out on an experimental volume. hjm On Friday 16 December 2011 07:43:06 Jeff White wrote: I have been told this is do-able but I haven't tested it much myself. I would be interested in hearing about this from anyone who has done it. From what I heard on irc you can do the following: 1. Have existing data in server1:/data1 2. Stop all changes to server1:/data1 3. Create a volume: gluster volume createserver1:/data1 server2:/data2 4. Mount the new volume via FUSE 5. Trigger a self-heal: find gluster-mount -noleaf -print0 | xargs --null stat /dev/null I also heard that there could be GFID problem if the data was previously in another Gluster volume. Jeff White - Linux/Unix Systems Engineer University of Pittsburgh - CSSD On 12/15/2011 01:48 PM, Harry Mangalam wrote: The use case is that we have a multiTB data partition that we would like to glusterize. Could we add that store to a gluster volume and have it explicitly rebalance across the gluster volume? Or would the existing files/layout be ignored? this would be a big selling point in justifying gluster to owners of large existing data stores. Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] what happens when adding a pre-populated brick?
The use case is that we have a multiTB data partition that we would like to glusterize. Could we add that store to a gluster volume and have it explicitly rebalance across the gluster volume? Or would the existing files/layout be ignored? this would be a big selling point in justifying gluster to owners of large existing data stores. -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] where's the API docs?
this still works for me.. http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3- beta-2/ hjm On Wednesday 14 December 2011 11:54:26 Redding, Erik wrote: I'm working on a proposal for a 1pb Gluster rig and I wanted to be able to present programmers with the API docs to sweeten the deal and to be able to programmatically mange it. I assumed there was a REST API for interfacing with the filesystem, either management or actual file i/o, because it appears in the feature list from time to time. I'm trying to dig around on how to pull 3.3 beta but with the Red Hat transition, I'm finding all of those offerings have disappeared. I'm digging around and Erik Redding Systems Programmer, RHCE Core Systems Texas State University-San Marcos On Dec 14, 2011, at 11:19 AM, John Mark Walker wrote: - Original Message - Ah - Ok thanks, Jeff. I was looking for the Swift and REST API docs. I assumed there were API interfaces in 3.2.5. I'll go dig up some roadmap info. I guess the first question to ask is, what are you looking to do? -JM -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] mixing rdma/tcp bricks, rebalance operation locked up
I built an rdma-based volume out of 5 bricks: $ gluster volume info Volume Name: glrdma Type: Distribute Status: Started Number of Bricks: 6 Transport-type: rdma Bricks: Brick1: pbs1:/data2 Brick2: pbs2:/data2 Brick3: pbs3:/data2 Brick4: pbs3:/data Brick5: pbs4:/data Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on and everything was working well. I then tried to add a TCP/socket brick to it, thinking that it would be refused, but gluster happily added it: $ gluster volume info Volume Name: glrdma Type: Distribute Status: Started Number of Bricks: 6 Transport-type: rdma Bricks: Brick1: pbs1:/data2 Brick2: pbs2:/data2 Brick3: pbs3:/data2 Brick4: pbs3:/data Brick5: pbs4:/data Brick6: dabrick:/data2 -- TCP/socket brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on However, not too suprisingly, there are problems when I tried to rebalance the added brick. It allowed me to start a rebalance/fix- layout, but it never ended and the logs continue to contain the following reports of 'connection refused' (see at bottom). Attempts to remove the TCP brick are unsuccessful, even after stopping the volume: $ gluster volume stop glrdma Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y Stopping volume glrdma has been successful $ gluster volume remove-brick glrdma dabrick:/data2 Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Remove Brick unsuccessful (more errors citing missing 'option transport-type'. defaulting to socket: [2011-12-13 10:34:57.241676] I [cli-rpc- ops.c:1073:gf_cli3_1_remove_brick_cbk] 0-cli: Received resp to remove brick [2011-12-13 10:34:57.241852] I [input.c:46:cli_batch] 0-: Exiting with: -1 [2011-12-13 10:46:08.937294] W [rpc- transport.c:606:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to socket [2011-12-13 10:46:09.110636] I [cli-rpc- ops.c:417:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2011-12-13 10:46:09.110845] I [cli-rpc- ops.c:596:gf_cli3_1_get_volume_cbk] 0-: Returning: 0 [2011-12-13 10:46:09.111038] I [cli-rpc- ops.c:417:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2011-12-13 10:46:09.111070] I [cli-rpc- ops.c:596:gf_cli3_1_get_volume_cbk] 0-: Returning: 0 [2011-12-13 10:46:09.111080] I [input.c:46:cli_batch] 0-: Exiting with: 0 [2011-12-13 10:52:18.142283] W [rpc- transport.c:606:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to socket And the rebalance operations now seem to be locked up, since the response to rebalance is nonsensical: (the commands were given serially, with no other intervening commands) $ gluster volume rebalance glrdma fix-layout start Rebalance on glrdma is already started $ gluster volume rebalance glrdma fix-layout status rebalance stopped $ gluster volume rebalance glrdma fix-layout stop stopped rebalance process of volume glrdma (after rebalancing 0 files totaling 0 bytes) $ gluster volume rebalance glrdma fix-layout start Rebalance on glrdma is already started Is there a way to back out of this situation? Or has incorrectly adding the TCP brick permanently hosed the volume? And does this imply a bug in the add-brick routine? (hopefully fixed?) Logs extracts -- tc-glusterd-mount-glrdma.log (and nfs.log, even tho I haven't tried to export it via nfs) has zillions of these lines: [2011-12-13 10:36:11.702130] E [rdma.c:4417:tcp_connect_finish] 0- glrdma-client-5: tcp connect to failed (Connection refused) cli.log has many of these lines: [2011-12-13 10:34:55.142428] W [rpc- transport.c:606:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to socket -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] forcing a brick to unload storage to replace disks?
Hi All, More time for gluster; after much travel in the twisty dark tunnels of OFED, IB, card firmware upgrades, OS compatibility, etc, I now have a distributed rdma volume over 5 bricks (2 on one server) and it seems to be working well. I would now like to force-unload one brick to emulate a disk upgrade process. Here's my vol info: --- Thu Dec 08 11:44:05 [0.08 0.05 0.01] root@pbs3:~ 522 $ gluster volume info Volume Name: glrdma Type: Distribute Status: Started Number of Bricks: 5 Transport-type: rdma Bricks: Brick1: pbs1:/data2 Brick2: pbs2:/data2 Brick3: pbs3:/data2 Brick4: pbs3:/data Brick5: pbs4:/data --- From the Admin doc, I can do a 'replace-brick' operation but that seems to require an unused brick as when I try to do that to an already incorporated brick, gluster complains that: --- Thu Dec 08 11:52:12 [0.00 0.01 0.00] root@pbs3:~ 524 $ gluster volume replace-brick glrdma pbs2:/data2 pbs4:/data start Brick: pbs4:/data already in use --- Is there a process whereby I can clear a brick by forcing the files to migrate to the other bricks? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] setting up a server
See Chapter 6 of the Admin guide: 6. Setting Up GlusterFS Server Volumes This assumes you've dedicated servers to be gluster bricks. If not, that's obviously 1st. Then install the gluster software, then on to chapter 6. hjm On Monday 28 November 2011 13:07:07 Steven Jones wrote: Hi, I have the install guide and the admin guidenothing in either that I can see from the contents pages tells me how to create the first server ( I assume that is what I have to do?) Is there another doc Im missing? or a good URL for a howto on a redhat based machine? Also from what I can see the free version is cli only? and there is no virtual (vmware) appliance? regards Steven Jones Technical Specialist - Linux RHCE Victoria University, Wellington, NZ 0064 4 463 6272 ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] OT: possible to upgrade mellanox HBA firmware from ubuntu?
I can verify that the CentOS approach works as Dan described it with the caveats prefixed by ##hjm, using a testbed machine running a fresh install of CentOS6. Thanks to Dan for such a nice outline. hjm On Tuesday 08 November 2011 12:11:44 Dan Cyr wrote: Harry, I had these same problems when I upgraded my firmware. Here's what I ended up doing (Centos6 or RHEL6 *not 6.1*): If you use the default kernel: 2.6.32_71.el6 (don't run yum update), then you can use the kernel-mft RPM included with OFED (kernel-mft-2.7.0-2.6.32_71.el6.x86_64.x86_64.rpm) yum -y install mstflint wget http://www.mellanox.com/downloads/ofed/MLNX_OFED_LINUX-1.5.3-1.0. 0-rhel6-x86_64.iso mkdir -p /mnt/iso mount -o loop MLNX_OFED_LINUX-1.5.3-1.0.0-rhel6-x86_64.iso /mnt/iso/ yum -y install /mnt/iso/RPMS/kernel- mft-2.7.0-2.6.32_71.el6.x86_64.x86_64.rpm --nogpgcheck ##hjm: you need to install the mst app from the mft rpm that is ##hjm: distributed in the above iso image ##hjm: sudo rpm -i /mnt/iso/RPMS/mft-2.7.0-20.x86_64.rpm mst start ls /dev/mst/ # Verify the proper device is there (*_pciconf?) - if there are none, this needs to be resolved before continuing. ##hjm: this worked fine: ls /dev/mst/ mt25204_pciconf0 mt25204_pci_cr0 mkdir -p /usr/src/firmware cd /usr/src/firmware # My cards are SuperMicro UIO - replace the below commands appropriately #wget ftp://ftp.supermicro.com/Firmware/InfiniBand/AOC-UINF-m2/aocuinfm 2_20090609.zip #unzip aocuinfm2_20090609.zip # -allow_psid_change might not be required - check with you hardware vendor - For me I found this link: http://64.174.237.178/support/faqs/faq.cfm?faq=9803 #flint -d /dev/mst/mt25418_pciconf0 -i aocuinfm2_20090609.bin -nofs -allow_psid_change burn #reboot === ##hjm: found this latest firmware upgrade. http://www.mellanox.com/content/pages.php?pg=custom_firmware_table And once you get the appro package and download and unpack it, the VERY latest firmware needs to be created with mlxburn like this: $ cd ~hjm/fw-25204-rel-1_2_940 $ sudo mlxburn -fw ./fw-25204-rel.mlx -dev /dev/mst/mt25204_pci_cr0 -nofs ## and it seems to work: $ flint -d /dev/mst/mt25204_pciconf0 q Image type: Failsafe FW Version: 1.2.940 ---VERY latest fw. I.S. Version:1 Device ID: 25204 Description: Node Port1Sys image GUIDs: 00066a0098006e5f 00066a00a0006e5f 00066a0098006e5f Board ID:j (MT_023002) VSD: j PSID:MT_023002 === so with that, the Good Luck. Dan From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Harry Mangalam Sent: Tuesday, November 08, 2011 11:45 AM To: gluster-users@gluster.org Subject: [Gluster-users] OT: possible to upgrade mellanox HBA firmware from ubuntu? Sorry for the OT, but this problem is preventing my further testing of gluster and this groups seems like they might know.. I've been looking into this for a few days and have not run across any success stories in upgrading firmware using Ubuntu (Mellanox supports RH and Suse), but I've tried to install the MFT/MST packages and it keeps erroring out for various reasons. If it's not possible (or very painful), I'll upgrade the card firmware via a LiveCD/DVD of SciLinux or CentOS on another machine. Other suggestions for upgrading the IB card firmware gratefully accepted. Harry -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] OT: possible to upgrade mellanox HBA firmware from ubuntu?
Sorry for the OT, but this problem is preventing my further testing of gluster and this groups seems like they might know.. I've been looking into this for a few days and have not run across any success stories in upgrading firmware using Ubuntu (Mellanox supports RH and Suse), but I've tried to install the MFT/MST packages and it keeps erroring out for various reasons. If it's not possible (or very painful), I'll upgrade the card firmware via a LiveCD/DVD of SciLinux or CentOS on another machine. Other suggestions for upgrading the IB card firmware gratefully accepted. Harry -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- This signature has been OCCUPIED! ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users