One other observation is that it seems to be genuinely related to the number of nodes involved.
If I run, say, 50 instances of my script using 50 separate nodes, then they almost always generate some failures. If I run the same number of instances, or even a much greater number, but using only 10 separate nodes, then they seem always to work OK. Maybe this is due to some kind of caching behaviour? .. Lana (lana.de...@gmail.com) On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere <lana.de...@gmail.com> wrote: > The gluster configuration is distribute, there are 4 server nodes. > > There are 53 physical client nodes in my setup, each with 8 cores; we > want to sometimes run more than 400 client processes simultaneously. > In practice we aren't yet trying that many. > > When I run the commands which break, I am running them on separate > clients simultaneously. > for host in <hosts>; do ssh $host script& done # Note the & > When I run on 25 clients simultaneously so far I have not seen it > fail. But if I run on 40 or 50 simultaneously it often fails. > > Sometimes I have run more than one command on each client > simultaneously by listing all the hosts multiple times in the > for-loop, > for host in <hosts> <hosts> <hosts>; do ssh $host script& done > In example of 3 at a time I have noticed that when a host works, all > three on that client will work; but when it fails, all three will fail > exactly the same fashion. > > I've attached a tarfile containing two sets of logs. In both cases I > had rotated all the log files and rebooted everything then run my > test. In the first set of logs, I went directly to approx. 50 > simultaneous sessions, and pretty much all of them just hung. (When > the find hangs, even a kill -9 will not unhang it.) So I rotated the > logs again and rebooted everything, but this time I gradually worked > my way up to higher loads. This time the failures were mostly cases > with the wrong checksum but no error message, though some of them did > give me errors like > find: lib/kbd/unimaps/cp865.uni: Invalid argument > > Thanks. I may try downgrading to 3.1.0 just to see if I have the same > problem there. > > > .. Lana (lana.de...@gmail.com) > > > > > > > On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G <raghaven...@gluster.com> > wrote: >> Hi Lana, >> >> I need some clarifications about test setup: >> >> * Are you seeing problem when there are more than 25 clients? If this is the >> case, are these clients on different physical nodes or is it that more than >> one client shares same node? In other words, clients are mounted on how many >> physical nodes are there in your test setup? Also, are you running the >> command on each of these clients simultaneously? >> >> * Or is it that there are more than 25 concurrent concurrent invocations of >> the script? If this is the case, how many clients are present in your test >> setup and on how many physical nodes these clients are mounted? >> >> regards, >> ----- Original Message ----- >> From: "Lana Deere" <lana.de...@gmail.com> >> To: gluster-users@gluster.org >> Sent: Saturday, December 4, 2010 12:13:30 AM >> Subject: [Gluster-users] 3.1.1 crashing under moderate load >> >> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA >> transport, native/fuse access. >> >> I have a directory which is shared on the gluster. In fact, it is a clone >> of /lib from one of the clients, shared so all can see it. >> >> I have a script which does >> find lib -type f -print0 | xargs -0 sum | md5sum >> >> If I run this on my clients one at a time, they all yield the same md5sum: >> for h in <<hosts>>; do ssh $host script; done >> >> If I run this on my clients concurrently, up to roughly 25 at a time they >> still yield the same md5sum. >> for h in <<hosts>>; do ssh $host script& done >> >> Beyond that the gluster share often, but not always, fails. The errors vary. >> - sometimes I get "sum: xxx.so not found" >> - sometimes I get the wrong checksum without any error message >> - sometimes the job simply hangs until I kill it >> >> >> Some of the server logs show messages like these from the time of the >> failures (other servers show nothing from around that time): >> >> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler] >> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp >> socket (peer: 10.54.255.240:1022) after handshake is complete >> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic] >> rpc-service: failed to submit message (XID: 0x55e82, Program: >> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport >> (rdma.RaidData-server) >> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] : >> Reply submission failed >> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic] >> rpc-service: failed to submit message (XID: 0x55e83, Program: >> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport >> (rdma.RaidData-server) >> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] : >> Reply submission failed >> >> >> On a client which had a failure I see messages like: >> >> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler] >> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket >> (peer: 10.54.50.101:24009) after handshake is complete >> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) >> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) >> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) >> op(READ(12)) called at 2010-12-03 10:03:06.20492 >> [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769] >> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) >> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) >> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) >> op(READ(12)) called at 2010-12-03 10:03:06.20529 >> [2010-12-03 10:03:06.26827] I >> [client-handshake.c:993:select_server_supported_programs] >> RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437), >> Version (310) >> [2010-12-03 10:03:06.27029] I >> [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1: >> Connected to 10.54.50.101:24009, attached to remote volume '/data'. >> [2010-12-03 10:03:06.27067] I >> [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2 >> fds open - Delaying child_up until they are re-opened >> >> >> Anyone else seen anything like this and/or have suggestions about options I >> can >> set to work around this? >> >> >> .. Lana (lana.de...@gmail.com) >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@gluster.org >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >> > _______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users