The default pvfs2 configuration files (and matching #define's in pvfs2-server.h) list the following job timeout values:

ServerJobBMITimeoutSecs 30
ServerJobFlowTimeoutSecs 30
ClientJobBMITimeoutSecs 300
ClientJobFlowTimeoutSecs 300

The flow timeouts trigger when X seconds have passed without moving any data, while the BMI timeouts trigger if the entire operation hasn't completed in X seconds.

This means that on the server side, each individual Trove write operation (broken into 256K chunks) has to complete within 30 seconds, or else it will cause a write flow timeout to trigger and the flow will be cancelled (to be restarted by the client).

We have recently found some test scenarios where 30 seconds isn't really long enough. In particular, if you have the following combination:

- fast server with a lot of RAM
- relatively high latency storage (old SAN hardware)
- very heavy write workload

If the system isn't tuned any (more on that in a later email), then what happens is the server cooks along accepting writes into its buffer cache until the buffer cache is practically exhausted. At that point it then tries to flush an enormous amount of data to the storage device, which has to hop over hba, switch, controller, raid etc. to get that data out. During this time newly posted writes will take a long time to complete.

The end result is that we can see even with standalone benchmarks that occasionally writes will take as long as 50 seconds to finish, despite the fact that they are only 256K in size.

Most likely all writes during this buffer flush time will take a while, regardless of the API used. It is worth noting though, that the glibc AOI implementation queues all I/O to a given file descriptor to be serviced sequentially in a single thread dedicated to that fd. If you have many clients writing to the same file, then you will probably end up with N delayed writes rather than just one, and timeout/cancellation scenarios with several clients.

I think we are going to run with the two ServerJob timeouts set to 300 seconds (as is already done for the client), but I just wanted to pass along the information in case there is interest in changing the stock default values.

-Phil

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to