One other observation is that it seems to be genuinely related to the
number of nodes involved.

If I run, say, 50 instances of my script using 50 separate nodes, then
they almost always generate some failures.

If I run the same number of instances, or even a much greater number,
but using only 10 separate nodes, then they seem always to work OK.

Maybe this is due to some kind of caching behaviour?

.. Lana (lana.de...@gmail.com)






On Mon, Dec 6, 2010 at 11:05 AM, Lana Deere <lana.de...@gmail.com> wrote:
> The gluster configuration is distribute, there are 4 server nodes.
>
> There are 53 physical client nodes in my setup, each with 8 cores; we
> want to sometimes run more than 400 client processes simultaneously.
> In practice we aren't yet trying that many.
>
> When I run the commands which break, I am running them on separate
> clients simultaneously.
>    for host in <hosts>; do ssh $host script& done  # Note the &
> When I run on 25 clients simultaneously so far I have not seen it
> fail.  But if I run on 40 or 50 simultaneously it often fails.
>
> Sometimes I have run more than one command on each client
> simultaneously by listing all the hosts multiple times in the
> for-loop,
>   for host in <hosts> <hosts> <hosts>; do ssh $host script& done
> In example of 3 at a time I have noticed that when a host works, all
> three on that client will work; but when it fails, all three will fail
> exactly the same fashion.
>
> I've attached a tarfile containing two sets of logs.  In both cases I
> had rotated all the log files and rebooted everything then run my
> test.  In the first set of logs, I went directly to approx. 50
> simultaneous sessions, and pretty much all of them just hung.  (When
> the find hangs, even a kill -9 will not unhang it.)  So I rotated the
> logs again and rebooted everything, but this time I gradually worked
> my way up to higher loads.  This time the failures were mostly cases
> with the wrong checksum but no error message, though some of them did
> give me errors like
>    find: lib/kbd/unimaps/cp865.uni: Invalid argument
>
> Thanks.  I may try downgrading to 3.1.0 just to see if I have the same
> problem there.
>
>
> .. Lana (lana.de...@gmail.com)
>
>
>
>
>
>
> On Mon, Dec 6, 2010 at 12:30 AM, Raghavendra G <raghaven...@gluster.com> 
> wrote:
>> Hi Lana,
>>
>> I need some clarifications about test setup:
>>
>> * Are you seeing problem when there are more than 25 clients? If this is the 
>> case, are these clients on different physical nodes or is it that more than 
>> one client shares same node? In other words, clients are mounted on how many 
>> physical nodes are there in your test setup? Also, are you running the 
>> command on each of these clients simultaneously?
>>
>> * Or is it that there are more than 25 concurrent concurrent invocations of 
>> the script? If this is the case, how many clients are present in your test 
>> setup and on how many physical nodes these clients are mounted?
>>
>> regards,
>> ----- Original Message -----
>> From: "Lana Deere" <lana.de...@gmail.com>
>> To: gluster-users@gluster.org
>> Sent: Saturday, December 4, 2010 12:13:30 AM
>> Subject: [Gluster-users] 3.1.1 crashing under moderate load
>>
>> I'm running GlusterFS 3.1.1, CentOS5.5 servers, CentOS5.4 clients, RDMA
>> transport, native/fuse access.
>>
>> I have a directory which is shared on the gluster.  In fact, it is a clone
>> of /lib from one of the clients, shared so all can see it.
>>
>> I have a script which does
>>    find lib -type f -print0 | xargs -0 sum | md5sum
>>
>> If I run this on my clients one at a time, they all yield the same md5sum:
>>    for h in <<hosts>>; do ssh $host script; done
>>
>> If I run this on my clients concurrently, up to roughly 25 at a time they
>> still yield the same md5sum.
>>    for h in <<hosts>>; do ssh $host script& done
>>
>> Beyond that the gluster share often, but not always, fails.  The errors vary.
>>    - sometimes I get "sum: xxx.so not found"
>>    - sometimes I get the wrong checksum without any error message
>>    - sometimes the job simply hangs until I kill it
>>
>>
>> Some of the server logs show messages like these from the time of the
>> failures (other servers show nothing from around that time):
>>
>> [2010-12-03 10:03:06.34328] E [rdma.c:4442:rdma_event_handler]
>> rpc-transport/rdma: rdma.RaidData-server: pollin received on tcp
>> socket (peer: 10.54.255.240:1022) after handshake is complete
>> [2010-12-03 10:03:06.34363] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e82, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-server)
>> [2010-12-03 10:03:06.34377] E [server.c:137:server_submit_reply] :
>> Reply submission failed
>> [2010-12-03 10:03:06.34464] E [rpcsvc.c:1548:rpcsvc_submit_generic]
>> rpc-service: failed to submit message (XID: 0x55e83, Program:
>> GlusterFS-3.1.0, ProgVers: 310, Proc: 12) to rpc-transport
>> (rdma.RaidData-server)
>> [2010-12-03 10:03:06.34520] E [server.c:137:server_submit_reply] :
>> Reply submission failed
>>
>>
>> On a client which had a failure I see messages like:
>>
>> [2010-12-03 10:03:06.21290] E [rdma.c:4442:rdma_event_handler]
>> rpc-transport/rdma: RaidData-client-1: pollin received on tcp socket
>> (peer: 10.54.50.101:24009) after handshake is complete
>> [2010-12-03 10:03:06.21776] E [rpc-clnt.c:338:saved_frames_unwind]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
>> op(READ(12)) called at 2010-12-03 10:03:06.20492
>> [2010-12-03 10:03:06.21821] E [rpc-clnt.c:338:saved_frames_unwind]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x3814a0f769]
>> (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
>> [0x3814a0ef1e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)
>> [0x3814a0ee8e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1)
>> op(READ(12)) called at 2010-12-03 10:03:06.20529
>> [2010-12-03 10:03:06.26827] I
>> [client-handshake.c:993:select_server_supported_programs]
>> RaidData-client-1: Using Program GlusterFS-3.1.0, Num (1298437),
>> Version (310)
>> [2010-12-03 10:03:06.27029] I
>> [client-handshake.c:829:client_setvolume_cbk] RaidData-client-1:
>> Connected to 10.54.50.101:24009, attached to remote volume '/data'.
>> [2010-12-03 10:03:06.27067] I
>> [client-handshake.c:698:client_post_handshake] RaidData-client-1: 2
>> fds open - Delaying child_up until they are re-opened
>>
>>
>> Anyone else seen anything like this and/or have suggestions about options I 
>> can
>> set to work around this?
>>
>>
>> .. Lana (lana.de...@gmail.com)
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Reply via email to