Cheers, Brad
Phil Carns wrote:
Ok, I think I _might_ see what the problem is with the BMI messaging.I haven't 100% confirmed yet, but it looks like we have the following scenario:On the client side: -------------------- - pvfs2-client-core starts a I/O operation (write) to server X - a send (for the request) is posted, which is a small buffer - the flow is posted before an ack is received - the flow itself posts another send for data, which is a large buffer - ...A few notes real quick- I think the above is a performance optimization; we try to go ahead and get the flow going before receiving a positive ack from the server. It will be canceled if we get a negative ack (or fail to get an ack altogether)- while the above is in progress, pvfs2-client-core starts another write operation to server X (from another application that is hitting the same server)- a send for this second request is posted - another flow is posted before an ack is received- depending on the timing, it may manage to post a send for data as well, which is another large buffer - this traffic is interleaved on the same socket as is being used for the first flow, which is still running at this pointOn the server side: -------------------- - the first I/O request arrives - it gets past the request scheduler - a flow is started and receives the first (large) data buffer - a different request for the same handle arrives - getattr would be a good example, could be from any client - this getattr gets queued in the request scheduler behind the write - the second I/O request arrives - it gets queued behind the getattr in the request schedulerAt this point on the server side, we have a flow in progress that is waiting on a data buffer. However, the next message is for a different flow (the tags don't match). Since this message is relatively large (256K), it is in rendezvous mode within bmi_tcp and cannot be pulled out of the socket until a matching receive is posted. The flow that is expected to post that receive is not running yet because the second I/O request is stuck in the scheduler.... so we have a deadlock. The socket is filled with data that the server isn't allowed to recv yet, and the data that it really needs next is stuck behind it.I'm not sure that I described that all that well. At a high level we have two flows sharing the same socket. The client started both of them and the messages got interleaved. The server only started one of them, but is now stuck because it can't deal with data arriving for the second one.I am going to try to find a brute force way to serialize I/O from each pvfs2-client-core just to see if that solves the problem (maybe only allowing one buffer between pvfs2-client-core and kernel, rather than 5). If that does look like it fixed the problem, then we need a more elegant solution. Maybe waiting for acks before starting flows, or just somehow serializing flows that share sockets.-Phil Phil Carns wrote:Hi Murali,I'm not seeing any problems with the dd scenario that you describe (I tried running 8-10 of them simultaneously), but I may just have different timing characteristics.One thing that I neglected to mention about the application that I am running is that it actually has 2 processes per client. That may be something that I have in common with your test case (multiple applications writing concurrently from the same client). I will reconfigure and try some test runs with 1 process per client and see if that impacts behavior for my test case.-Phil Phil Carns wrote:I reverted my src/io/bmi directory back to a snapshot of what was in CVS around March, but I still have the same problems. It doesn't seem to be impacted by the pipe changes or the socket buffer size changes.I haven't tried your dd test case yet, but I will give that a shot soon. -Phil Murali Vilayannur wrote:Hi Phil, others, I think I am seeing the same problem (or some variants of it) with the NFSv4 bug that I reported a few days ago.. Here is a very simple way that I am able to reproduce it reliably.. Mount a pvfs2 volume and run a few instances of dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100All the dd instances simply wait for something to happen, client-core and servers are also waiting for something to happen and no process makes anyprogress.. It is perhaps related to the pipe changes to bmi-tcp with poll? Let me revert those changes and try it out as well.. Thanks, Murali On Tue, 13 Jun 2006, Phil Carns wrote:I went back and added some much more specific debugging messages and put some special prefixes on flow, bmi, and request processor messages so Icould group them a little easier, and got rid of the extra mutexes. After running a few more tests and double checking the logs, this is what I am seeing in every failure case: - it is a write flow - it gets broken into two buffers within the flow protocol - the first is 256K - the second is 64K - both sides _do_ get the request processing correct - both sides post the correct BMI operations - the server posts 2 receives - the client posts 2 sends - the second data message never shows up at the server I must have missed part of the flow/request processing on the client side when I looked before. Sorry about the wild goose chase with the request process question. The request processing definitely looks correct in the logs that I am seeing now (or at least both the client and server agree on it anyway). BMI has logging that shows when incoming messages appear on a socket, and it includes the tag number. The 64K message never shows up on asocket at the server side. The BMI operation therefore never completes,and the callback never gets triggered to complete the flow. I don't know if this is a bmi_tcp problem, or if there is some sort ofgeneral network wackiness in my environment. I'll try to revert to someolder versions of bmi_tcp tomorrow and see if the problem persists. I already tried just backing out changes to the poll notification in socket-collection.[ch] and (assuming I got that right) it didn't make any difference. -Phil Phil Carns wrote:Hi Sam, I may be wrong, but so far it looks like the problem is a little different this time. In the scenario that I am seeing now it doesn'tlook like the flow has finished everything and then just failed to markcompletion- it looks like one side actually posted more BMI messages than the other. The server posted two receives of 256K, while theclient only posted one send of 256K. I don't see any zero byte resultsfrom the request processor or any cases where the flow should have marked completion. I'm still trying to get a better log. After adding mutexes inPINT_process_request() on the server side, the problem has gotten much harder to reproduce. I don't know if there is any correlation or if I am just perturbing the timing enough to hide the problem. So far out of9 runs I have had only 1 failure, and the logs from that failure were inconclusive. I definitely agree with both of your points about reworking the flow code; it is difficult to debug right now. -Phil Sam Lang wrote:On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:Hi all, I am looking at an I/O problem that I don't completely understand. The setup is that there are 15 servers and 20 clients (all RHEL3 SMP). The clients are running a proprietary application. At the end of the run they each write their share of a data set into a 36 GB file. So each is doing a contiguous write of 36/20 GB of data. Roughly 80-90% of the time, one or more of the clients and servershang, though it isn't predictable which one will do it. After quitea bit of digging through logs, it turns out that there is a flow that never completes on the server side. There are two problems contributing here: 1) The flow timeout mechanism doesn't seem to be working right for me (the timer for the flow pops and checks progress one time, thennever gets triggered again). This isn't the core problem, but it iscausing the flow to get stuck forever and hold up the request scheduler rather than giving up eventually. 2) For this particular flow, the client and server appear to disagree on how much data to exchange. The client sends one 256Kbuffer and completes the flow. The server receives that 256K bufferbut still expects at least one more to arrive (which never does).Hi Phil, these problems sound very similar to the bug I was seeing about a month ago with the flow code for certain request types. The particular case I was looking at was with reads, so given that there's a lot of code duplication in flow I could imagine the same problem existing (but not getting fixed) for writes. It may not be the same as the problem you're seeing, but I'll walk through the bug and how I fixed it, in case it helps. Basically, the request was ending right at the boundary of a flowbuffer (so bytes == bytemax), this caused the request processing code to finish and the bmi_send_callback_fn would return, but it would getcalled again by the post code for the next buffer, except this time the request processing would return 0 bytes, and this queue item would get marked as the last one. It was that last queue item that was the problem: since the trove_bstream_read_list was never getting posted, the queue item who's last field was set, was never making it into the trove_read_callback_fn, so the dest_last_posted field for the flow data wasn't getting set either. The fix was to add some extra checking in bmi_send_callback_fn. Inthe case that the request processing returns 0 bytes processed, I canset the dest_last_posted field to 1 immediately. The only caveat tothis is in the case that other queue items in the source list haven't made it to the dest list yet, so we need to check for that case. Thediff of the changes I made might help: http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106In any case, I wasn't able to make use of the debug messages for the request processing code, it was too verbose to make any sense of it. I think I ended up just adding debug statements for the entry and exit points of the flow callbacks along with all the different side- state fields as they were changed. I'm not sure those debug statements made it into the trunk though. :-( Since we're here, I think the flow code could benefit from some reworking in two ways: 1. Less side-state variables to be set. Each queue item has fieldsthat change the behavior of when things should complete. Its hard tokeep track of all that. 2. More code abstraction. We end up doing very similar things between reads and writes on the server (and on the client too), especially with respect to the request processing. It might make sense to abstract out some of that code duplication into separate functions. -samI am pretty sure I can get to the bottom of the first problem. The second one is trickier. I attached a few log file excerpts: client-1st.txt: This is the request processing log output from the client side when it is working on the flow in question server-1st.txt: request processing log output from the first iteration of the server side flow server-2nd.txt: request processing log output from the second iteration of the server side flow Anyone have any suggestions, or see any indication from those log portions about why the server and client arn't on the same page? Some of this log output (the server-2nd.txt in particular?) looks a little jumbled, like maybe there are two flows doing request processing at once, but I don't really know how to interpret it.I may try adding a mutex around PINT_process_request() next, just tosee if I can make the logs clearer by making sure only one invocation is going at once. -Phil_______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers