Re: [Pvfs2-developers] tuning kernel buffer settings
Hi Phil, Attached patch fixes the read buffer bug that you had mentioned and also implements the variable sized buffer counts and lengths that we can pass as command line options to pvfs2-client-core. I did not implement module time options for buffer size settings since that is fairly complicated and not intuitive (client core driving the buffer size and count settings seems to make more sense to me). So now we can do pvfs2-client --desc-count= --desc-size= in addition to the usual options. With regards to the changes itself, this involved modifying the parameters of an existing ioctl, and so we break binary compatibility, but I don't think we have a policy of maintaining backward binary compatibility, do we? I have updated the compat ioctl code as well, so hopefully we won't break in mixed 32-64 bit environments. I have tested this out with various buffer sizes and counts on 32 bit platforms only! That said, I haven't done a comprehensive testing..so there may still be bugs.. Please review it and let me know if this looks ok. BTW: patch is against pvfs-2.6.0..sorry abut that. cvs ports are firewalled off at work and my internet at home is temporarily not working. thanks, Murali On 11/29/06, Murali Vilayannur <[EMAIL PROTECTED]> wrote: Hi Phil, Thanks for running these tests. I think this buffer size will be dependant on the machine configuration right? If we work out a simple formula for the buffer size based on say memory b/w (and/or latency), network b/w (and/or latency), we could plug that in as a sane default (bandwidth . I did not realize that this setting will have such a noticable effect on performance. I can work on a patch to change these settings at runtime. thanks, Murali > >> - single client > >> - 16 servers > >> - gigabit ethernet > >> - read/write tests, with 40 GB files > >> - using reads and writes of 100 MB each in size > >> - varying number of processes running concurrently on the client > >> > >> This test application can be configured to be run with multiple > >> processes and/or multiple client nodes. In this case we kept > >> everything on a single client to focus on bottlenecks on that side. > >> > >> What we were looking at was the kernel buffer settings controlled in > >> pint-dev-shared.h. By default PVFS2 uses 5 buffers of 4 MB each. > >> After experimenting for a while, we made a few observations: > >> > >> - increasing the buffer size helped performance > >> - using only 2 buffers (rather than 5) was sufficient to saturate the > >> client when we were running multiple processes; adding more made only > >> a marginal difference > >> > >> We found good results using 2 32MB buffers. Here are some > >> comparisons between the standard settings and the 2 x 32MB > >> configuration: > >> > >> results for RHEL4 (2.6 kernel): > >> -- > >> 5 x 4MB, 1 process: 83.6 MB/s > >> 2 x 32MB, 1 process: 95.5 MB/s > >> > >> 5 x 4MB, 5 processes: 107.4 MB/s > >> 2 x 32MB, 5 processes: 111.2 MB/s > >> > >> results for RHEL3 (2.4 kernel): > >> --- > >> 5 x 4MB, 1 process: 80.5 MB/s > >> 2 x 32MB, 1 process: 90.7 MB/s > >> > >> 5 x 4MB, 5 processes: 91 MB/s > >> 2 x 32MB, 5 processes: 103.5 MB/s > >> > >> > >> A few comments based on those numbers: > >> > >> - on 3 out of 4 tests, we saw a 13-15% performance improvement by > >> going to 2 32 MB buffers > >> - the remaining test (5 process RHEL4) probably did not see as much > >> improvement because we maxed out the network. In the past, netpipe > >> has shown that we can get around 112 MB/s out of these nodes. > >> - the RHEL3 nodes are on a different switch, so it is hard to say how > >> much of the difference from RHEL3 to RHEL4 is due to network topology > >> and how much is due to the kernel version > >> > >> It is also worth noting that even with this tuning, the single > >> process tests are about 14% slower than the 5 process tests. I am > >> guessing that this is due to a lack of pipelining, probably caused by > >> two things: > >> - the application only submitting one read/write at a time > >> - the kernel module itself serializing when it breaks reads/writes > >> into buffer sized chunks > >> > >> The latter could be addressed by either pipelining the I/O through > >> the bufmap interface (so that a single read or write could keep > >> multiple buffers busy) or by going to a system like Murali came up > >> with for memory transfers a while back that isn't limited by buffer > >> size. > >> > >> It would also be nice to have a way to set these buffer settings > >> without recompiling- either via module options or via pvfs2-client- > >> core command line options. For the time being we are going to hard > >> code our tree to run with the 32 MB buffers. The 64 MB of RAM that > >> this uses up (vs. 20 MB with the old settings) doesn't really matter > >> for our standard node footprint. > >> > >> -Phil > >> ___ > >> Pvfs2-deve
Re: [Pvfs2-developers] tuning kernel buffer settings
Hi Phil, Thanks for running these tests. I think this buffer size will be dependant on the machine configuration right? If we work out a simple formula for the buffer size based on say memory b/w (and/or latency), network b/w (and/or latency), we could plug that in as a sane default (bandwidth . I did not realize that this setting will have such a noticable effect on performance. I can work on a patch to change these settings at runtime. thanks, Murali >> - single client >> - 16 servers >> - gigabit ethernet >> - read/write tests, with 40 GB files >> - using reads and writes of 100 MB each in size >> - varying number of processes running concurrently on the client >> >> This test application can be configured to be run with multiple >> processes and/or multiple client nodes. In this case we kept >> everything on a single client to focus on bottlenecks on that side. >> >> What we were looking at was the kernel buffer settings controlled in >> pint-dev-shared.h. By default PVFS2 uses 5 buffers of 4 MB each. >> After experimenting for a while, we made a few observations: >> >> - increasing the buffer size helped performance >> - using only 2 buffers (rather than 5) was sufficient to saturate the >> client when we were running multiple processes; adding more made only >> a marginal difference >> >> We found good results using 2 32MB buffers. Here are some >> comparisons between the standard settings and the 2 x 32MB >> configuration: >> >> results for RHEL4 (2.6 kernel): >> -- >> 5 x 4MB, 1 process: 83.6 MB/s >> 2 x 32MB, 1 process: 95.5 MB/s >> >> 5 x 4MB, 5 processes: 107.4 MB/s >> 2 x 32MB, 5 processes: 111.2 MB/s >> >> results for RHEL3 (2.4 kernel): >> --- >> 5 x 4MB, 1 process: 80.5 MB/s >> 2 x 32MB, 1 process: 90.7 MB/s >> >> 5 x 4MB, 5 processes: 91 MB/s >> 2 x 32MB, 5 processes: 103.5 MB/s >> >> >> A few comments based on those numbers: >> >> - on 3 out of 4 tests, we saw a 13-15% performance improvement by >> going to 2 32 MB buffers >> - the remaining test (5 process RHEL4) probably did not see as much >> improvement because we maxed out the network. In the past, netpipe >> has shown that we can get around 112 MB/s out of these nodes. >> - the RHEL3 nodes are on a different switch, so it is hard to say how >> much of the difference from RHEL3 to RHEL4 is due to network topology >> and how much is due to the kernel version >> >> It is also worth noting that even with this tuning, the single >> process tests are about 14% slower than the 5 process tests. I am >> guessing that this is due to a lack of pipelining, probably caused by >> two things: >> - the application only submitting one read/write at a time >> - the kernel module itself serializing when it breaks reads/writes >> into buffer sized chunks >> >> The latter could be addressed by either pipelining the I/O through >> the bufmap interface (so that a single read or write could keep >> multiple buffers busy) or by going to a system like Murali came up >> with for memory transfers a while back that isn't limited by buffer >> size. >> >> It would also be nice to have a way to set these buffer settings >> without recompiling- either via module options or via pvfs2-client- >> core command line options. For the time being we are going to hard >> code our tree to run with the 32 MB buffers. The 64 MB of RAM that >> this uses up (vs. 20 MB with the old settings) doesn't really matter >> for our standard node footprint. >> >> -Phil >> ___ >> Pvfs2-developers mailing list >> Pvfs2-developers@beowulf-underground.org >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >> > ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] read buffer bug
Hi guys, I am really sorry about this. I am surprised we did not catch this earlier. This was basically introduced by the file.c/bufmap.c cleanups that I had done a while back. Attached patch should fix this error. thanks for the testcase, Phil! Murali On 11/29/06, Phil Carns <[EMAIL PROTECTED]> wrote: I ran into a problem today with the 2.6.0 release. This happened to show up in the read04 LTP test, but not reliably. I have attached a test program that I think does trigger it reliably, though. When run on ext3: /home/pcarns> ./testme /tmp/foo.txt read returned: 7, test_buf: hello world When run on pvfs2: /home/pcarns> ./testme /mnt/pvfs2/foo.txt read returned: 7, test_buf: hello (or sometimes you might get garbage after the "hello") The test program creates a string buffer with "goodbye world" stored in it. It then reads the string "hello " out of a file into the beginning of that buffer. The result should be that the final resulting string is "hello world". The trick that makes this fail is asking to read more than 7 bytes from the file. In this particular test program, we attempt to do a read of 255 bytes. There are only 7 bytes in the file, though. The return code from read accurately reflects this. However, rather than just fill in the first 7 bytes of the buffer, it looks like PVFS2 is overwriting the full 255 bytes. What ends up in those trailing 248 bytes is somewhat random. I suspect that somewhere in the kernel module there is a copy_to_user() call that is copying the number of bytes requested by the read rather than the number of bytes returned by the servers. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers err Description: Binary data ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveSyncData settings
On Nov 29, 2006, at 3:44 PM, Rob Ross wrote: That's what I was thinking -- that we could ask the I/O thread to do the syncing rather than stalling out other progress. Wanna try it and see if it helps :)? Rob Phil Carns wrote: No. Both alt aio and the normal dbpf method sync as a seperate step after the aio list operation completes. This is technically possible with alt aio, though- you would just need to pass a flag through to tell the I/O thread to sync after the pwrite(). That would probably be pretty helpful, so the trove worker thread doesn't get stuck waiting on the sync... -Phil Rob Ross wrote: This is similar to using O_DIRECT, which has also shown benefits. With alt aio, do we sync in the context of the I/O thread? Thanks, Rob Phil Carns wrote: One thing that we noticed while testing for storage challenge was that (and everyone correct me if I'm wrong here) enabling the data-sync causes a flush/sync to occur after every sizeof (FlowBuffer) bytes had been written. I can imagine how this would help a SAN, but I'm perplexed how it helps localdisk, what buffer size are you playing with? We found that unless we were using HUGE (~size of cache on storage controller) flowbuffers that this caused way too many syncs/seeks on the disks and hurt performance quite a bit, maybe even as bad as 50% performance because things were not being optimized for our disk subsystems and we were issuing many small ops instead of fewer large ones. Granted I havent been able to get 2.6.0 building properly yet to test the latest out, but this was definitely the case for us on the 2.5 releases. You are definitely right about the data sync option causing a flush/sync on every sizeof(FlowBuffer). I had a note that we should change the default aio data-sync code to only sync at the end of an IO request, instead of for each trove operation (in FlowBufferSize chunks). Doing this at the end of an io.sm seemed a little messy, but if/when we have request ids (hints) being passed to the trove interface, we could use that as a way to know to flush at the end. In any case, it sounds like its better to flush early and often than at the end of a request? From a user perspective, we usually tell people to enable data sync if they're concerned about losing data. Now we're talking about getting better performance with data sync enabled (at least in some cases). Does it make sense to sync even with data sync disabled if we can figure out that better performance would result? -sam I don't really have a good explanation for why this doesn't seem to burn us anymore on local disk. Our settings are standard, except for: - 512KB flow buffer size - alt aio method - 512KB tcp buffers (with larger /proc tcp settings) This testing was done on some version prior to 2.6.0 also (I think it was a merge of some in-between release, so it is hard to pin down a version number). It may also have something to do with the controller and local disks being used? All of our local disk configurations are actually hardware raid 5 with some variety of the megaraid controller, and these are fairly new boxes. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2- developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] tuning kernel buffer settings
No, these test results predate the 2.6.0 release. We are planning to test with the threaded pvfs2-client later on, though. -Phil Sam Lang wrote: These are great results Phil. Its nice to have you guys doing this testing. Did you get a chance to run any of your tests with the threaded version of pvfs2-client? I added -threaded option, which runs pvfs2-client-core-threaded instead of pvfs2-client-core. For the case where you're running multiple processes concurrently, I wonder if you would see some improvement, although Dean didn't see any when he tried it with one process doing concurrent reads/writes from multiple threads. Just a thought. I'd also be curious what affect the mallocs have on performance. I added a fix to Walt's branch for the allocation of all the lookup segment contexts on every request from the VFS, but that hasn't propagated into HEAD yet. -sam On Nov 29, 2006, at 9:58 AM, Phil Carns wrote: We recently ran some tests that we thought would be interesting to share. We used the following setup: - single client - 16 servers - gigabit ethernet - read/write tests, with 40 GB files - using reads and writes of 100 MB each in size - varying number of processes running concurrently on the client This test application can be configured to be run with multiple processes and/or multiple client nodes. In this case we kept everything on a single client to focus on bottlenecks on that side. What we were looking at was the kernel buffer settings controlled in pint-dev-shared.h. By default PVFS2 uses 5 buffers of 4 MB each. After experimenting for a while, we made a few observations: - increasing the buffer size helped performance - using only 2 buffers (rather than 5) was sufficient to saturate the client when we were running multiple processes; adding more made only a marginal difference We found good results using 2 32MB buffers. Here are some comparisons between the standard settings and the 2 x 32MB configuration: results for RHEL4 (2.6 kernel): -- 5 x 4MB, 1 process: 83.6 MB/s 2 x 32MB, 1 process: 95.5 MB/s 5 x 4MB, 5 processes: 107.4 MB/s 2 x 32MB, 5 processes: 111.2 MB/s results for RHEL3 (2.4 kernel): --- 5 x 4MB, 1 process: 80.5 MB/s 2 x 32MB, 1 process: 90.7 MB/s 5 x 4MB, 5 processes: 91 MB/s 2 x 32MB, 5 processes: 103.5 MB/s A few comments based on those numbers: - on 3 out of 4 tests, we saw a 13-15% performance improvement by going to 2 32 MB buffers - the remaining test (5 process RHEL4) probably did not see as much improvement because we maxed out the network. In the past, netpipe has shown that we can get around 112 MB/s out of these nodes. - the RHEL3 nodes are on a different switch, so it is hard to say how much of the difference from RHEL3 to RHEL4 is due to network topology and how much is due to the kernel version It is also worth noting that even with this tuning, the single process tests are about 14% slower than the 5 process tests. I am guessing that this is due to a lack of pipelining, probably caused by two things: - the application only submitting one read/write at a time - the kernel module itself serializing when it breaks reads/writes into buffer sized chunks The latter could be addressed by either pipelining the I/O through the bufmap interface (so that a single read or write could keep multiple buffers busy) or by going to a system like Murali came up with for memory transfers a while back that isn't limited by buffer size. It would also be nice to have a way to set these buffer settings without recompiling- either via module options or via pvfs2-client- core command line options. For the time being we are going to hard code our tree to run with the 32 MB buffers. The 64 MB of RAM that this uses up (vs. 20 MB with the old settings) doesn't really matter for our standard node footprint. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] threaded library dependency
On Nov 28, 2006, at 7:59 PM, Phil Carns wrote: $(KERNAPPSTHR): %: %.o $(LIBRARIES_THREADED) committed. Thanks for the report Phil. -sam ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveMethod configuration parameter
On Nov 29, 2006, at 4:20 PM, Phil Carns wrote: Good call. I went ahead and committed both of these changes. Great- thanks! FWIW, I'm not crazy about the redundancy that we have for filesystems in the fs.conf and the collections.db. Would anyone else be in favor of getting rid of the collections.db altogether? I looked at the code, and the only time we ever use that db is when we create or delete a collection, or to verify that an fsid is valid. pvfs2- showcoll uses it to print all of the collections, but for all these cases we could just use the entries in the fs.conf, or look for actual directories in the storage space. The advantage to removing the collections.db would be that trove_initialize wouldn't need a method for itself, independent of the individual collections. Thoughts? That seems reasonable to me. Can this be done without breaking storage space compatibility? It seems like it should- newer servers would probably just ignore the old collections.db if it is still laying around. Yes I think so. -sam -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] tuning kernel buffer settings
These are great results Phil. Its nice to have you guys doing this testing. Did you get a chance to run any of your tests with the threaded version of pvfs2-client? I added -threaded option, which runs pvfs2-client-core-threaded instead of pvfs2-client-core. For the case where you're running multiple processes concurrently, I wonder if you would see some improvement, although Dean didn't see any when he tried it with one process doing concurrent reads/writes from multiple threads. Just a thought. I'd also be curious what affect the mallocs have on performance. I added a fix to Walt's branch for the allocation of all the lookup segment contexts on every request from the VFS, but that hasn't propagated into HEAD yet. -sam On Nov 29, 2006, at 9:58 AM, Phil Carns wrote: We recently ran some tests that we thought would be interesting to share. We used the following setup: - single client - 16 servers - gigabit ethernet - read/write tests, with 40 GB files - using reads and writes of 100 MB each in size - varying number of processes running concurrently on the client This test application can be configured to be run with multiple processes and/or multiple client nodes. In this case we kept everything on a single client to focus on bottlenecks on that side. What we were looking at was the kernel buffer settings controlled in pint-dev-shared.h. By default PVFS2 uses 5 buffers of 4 MB each. After experimenting for a while, we made a few observations: - increasing the buffer size helped performance - using only 2 buffers (rather than 5) was sufficient to saturate the client when we were running multiple processes; adding more made only a marginal difference We found good results using 2 32MB buffers. Here are some comparisons between the standard settings and the 2 x 32MB configuration: results for RHEL4 (2.6 kernel): -- 5 x 4MB, 1 process: 83.6 MB/s 2 x 32MB, 1 process: 95.5 MB/s 5 x 4MB, 5 processes: 107.4 MB/s 2 x 32MB, 5 processes: 111.2 MB/s results for RHEL3 (2.4 kernel): --- 5 x 4MB, 1 process: 80.5 MB/s 2 x 32MB, 1 process: 90.7 MB/s 5 x 4MB, 5 processes: 91 MB/s 2 x 32MB, 5 processes: 103.5 MB/s A few comments based on those numbers: - on 3 out of 4 tests, we saw a 13-15% performance improvement by going to 2 32 MB buffers - the remaining test (5 process RHEL4) probably did not see as much improvement because we maxed out the network. In the past, netpipe has shown that we can get around 112 MB/s out of these nodes. - the RHEL3 nodes are on a different switch, so it is hard to say how much of the difference from RHEL3 to RHEL4 is due to network topology and how much is due to the kernel version It is also worth noting that even with this tuning, the single process tests are about 14% slower than the 5 process tests. I am guessing that this is due to a lack of pipelining, probably caused by two things: - the application only submitting one read/write at a time - the kernel module itself serializing when it breaks reads/writes into buffer sized chunks The latter could be addressed by either pipelining the I/O through the bufmap interface (so that a single read or write could keep multiple buffers busy) or by going to a system like Murali came up with for memory transfers a while back that isn't limited by buffer size. It would also be nice to have a way to set these buffer settings without recompiling- either via module options or via pvfs2-client- core command line options. For the time being we are going to hard code our tree to run with the 32 MB buffers. The 64 MB of RAM that this uses up (vs. 20 MB with the old settings) doesn't really matter for our standard node footprint. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveMethod configuration parameter
Good call. I went ahead and committed both of these changes. Great- thanks! FWIW, I'm not crazy about the redundancy that we have for filesystems in the fs.conf and the collections.db. Would anyone else be in favor of getting rid of the collections.db altogether? I looked at the code, and the only time we ever use that db is when we create or delete a collection, or to verify that an fsid is valid. pvfs2- showcoll uses it to print all of the collections, but for all these cases we could just use the entries in the fs.conf, or look for actual directories in the storage space. The advantage to removing the collections.db would be that trove_initialize wouldn't need a method for itself, independent of the individual collections. Thoughts? That seems reasonable to me. Can this be done without breaking storage space compatibility? It seems like it should- newer servers would probably just ignore the old collections.db if it is still laying around. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveSyncData settings
That's what I was thinking -- that we could ask the I/O thread to do the syncing rather than stalling out other progress. Wanna try it and see if it helps :)? Rob Phil Carns wrote: No. Both alt aio and the normal dbpf method sync as a seperate step after the aio list operation completes. This is technically possible with alt aio, though- you would just need to pass a flag through to tell the I/O thread to sync after the pwrite(). That would probably be pretty helpful, so the trove worker thread doesn't get stuck waiting on the sync... -Phil Rob Ross wrote: This is similar to using O_DIRECT, which has also shown benefits. With alt aio, do we sync in the context of the I/O thread? Thanks, Rob Phil Carns wrote: One thing that we noticed while testing for storage challenge was that (and everyone correct me if I'm wrong here) enabling the data-sync causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had been written. I can imagine how this would help a SAN, but I'm perplexed how it helps localdisk, what buffer size are you playing with? We found that unless we were using HUGE (~size of cache on storage controller) flowbuffers that this caused way too many syncs/seeks on the disks and hurt performance quite a bit, maybe even as bad as 50% performance because things were not being optimized for our disk subsystems and we were issuing many small ops instead of fewer large ones. Granted I havent been able to get 2.6.0 building properly yet to test the latest out, but this was definitely the case for us on the 2.5 releases. You are definitely right about the data sync option causing a flush/sync on every sizeof(FLowBuffer). I don't really have a good explanation for why this doesn't seem to burn us anymore on local disk. Our settings are standard, except for: - 512KB flow buffer size - alt aio method - 512KB tcp buffers (with larger /proc tcp settings) This testing was done on some version prior to 2.6.0 also (I think it was a merge of some in-between release, so it is hard to pin down a version number). It may also have something to do with the controller and local disks being used? All of our local disk configurations are actually hardware raid 5 with some variety of the megaraid controller, and these are fairly new boxes. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveSyncData settings
No. Both alt aio and the normal dbpf method sync as a seperate step after the aio list operation completes. This is technically possible with alt aio, though- you would just need to pass a flag through to tell the I/O thread to sync after the pwrite(). That would probably be pretty helpful, so the trove worker thread doesn't get stuck waiting on the sync... -Phil Rob Ross wrote: This is similar to using O_DIRECT, which has also shown benefits. With alt aio, do we sync in the context of the I/O thread? Thanks, Rob Phil Carns wrote: One thing that we noticed while testing for storage challenge was that (and everyone correct me if I'm wrong here) enabling the data-sync causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had been written. I can imagine how this would help a SAN, but I'm perplexed how it helps localdisk, what buffer size are you playing with? We found that unless we were using HUGE (~size of cache on storage controller) flowbuffers that this caused way too many syncs/seeks on the disks and hurt performance quite a bit, maybe even as bad as 50% performance because things were not being optimized for our disk subsystems and we were issuing many small ops instead of fewer large ones. Granted I havent been able to get 2.6.0 building properly yet to test the latest out, but this was definitely the case for us on the 2.5 releases. You are definitely right about the data sync option causing a flush/sync on every sizeof(FLowBuffer). I don't really have a good explanation for why this doesn't seem to burn us anymore on local disk. Our settings are standard, except for: - 512KB flow buffer size - alt aio method - 512KB tcp buffers (with larger /proc tcp settings) This testing was done on some version prior to 2.6.0 also (I think it was a merge of some in-between release, so it is hard to pin down a version number). It may also have something to do with the controller and local disks being used? All of our local disk configurations are actually hardware raid 5 with some variety of the megaraid controller, and these are fairly new boxes. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
[Pvfs2-developers] read buffer bug
I ran into a problem today with the 2.6.0 release. This happened to show up in the read04 LTP test, but not reliably. I have attached a test program that I think does trigger it reliably, though. When run on ext3: /home/pcarns> ./testme /tmp/foo.txt read returned: 7, test_buf: hello world When run on pvfs2: /home/pcarns> ./testme /mnt/pvfs2/foo.txt read returned: 7, test_buf: hello (or sometimes you might get garbage after the "hello") The test program creates a string buffer with "goodbye world" stored in it. It then reads the string "hello " out of a file into the beginning of that buffer. The result should be that the final resulting string is "hello world". The trick that makes this fail is asking to read more than 7 bytes from the file. In this particular test program, we attempt to do a read of 255 bytes. There are only 7 bytes in the file, though. The return code from read accurately reflects this. However, rather than just fill in the first 7 bytes of the buffer, it looks like PVFS2 is overwriting the full 255 bytes. What ends up in those trailing 248 bytes is somewhat random. I suspect that somewhere in the kernel module there is a copy_to_user() call that is copying the number of bytes requested by the read rather than the number of bytes returned by the servers. -Phil #include #include #include #include #include int main(int argc, char **argv) { int fd = 0; int ret = 0; char test_string[] = "hello "; char test_buf[256]; if(argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); return(-1); } fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, (S_IRUSR | S_IWUSR)); if(fd < 0) { perror("open"); return(-1); } ret = write(fd, test_string, strlen(test_string)); if(ret != strlen(test_string)) { fprintf(stderr, "Error: write failed.\n"); return(-1); } /* put some garbage in the buffer so we can detect if something goes wrong */ strcpy(test_buf, "goodbye world"); lseek(fd, 0, SEEK_SET); ret = read(fd, test_buf, 255); printf("read returned: %d, test_buf: %s\n", ret, test_buf); close(fd); return(0); } ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveMethod configuration parameter
On Nov 28, 2006, at 12:34 PM, Phil Carns wrote: The main issue here is that the trove initialize doesn't know which method to use, so there needs to be a global default. Each collection also gets its own method as you've noticed, so you can specify that on a per collection basis. Ok- that makes sense. Thanks for the confirmation. The pvfs2-genconfig utility places TroveMethod option in the section. Is this intentional? For example, if someone edited a configuration file created by genconfig, they might try to change the value of TroveMethod from "dbpf" to "alt- aio". However, it doesn't look like this would have any real impact. It would change the parameter to trove's initialize() function, but the collection_lookup() would still default to the dbpf method. Hmm.. that's true. The config format and the trove interfaces don't match up real well. Ideally, there wouldn't be a TroveMethod in the section at all, and the trove initialize would work on a per-collection basis, but right now we store collection info in both the config file and inside the collections database. To be able to open and read from the collections database, we need a trove method. Maybe it makes sense to use the default from the section if one isn't specified for that collection? I'm not crazy about this solution but it seems like a reasonable alternative. That seems a little more intuitive. Most people editing the config file would probably expect behavior like that (not knowing what the trove stack looks like). Here are a couple of other ideas to throw out there: B) split this into two keywords (hopefully with better names than the examples below): - TroveCollectionMethod: valid _only_ in the StorageHints section - TroveStorageSpaceMethod: valid _only_ in the Defaults section C) keep the existing scheme, but just with two tweaks: - clarify the comments/documentation in server-config.c to indicate that the parameter means something a little different depending on where it is used - change pvfs2-genconfig to emit the "TroveMethod dbpf" line in the StorageHints section rather than the Defaults section. That way someone who comes along later and edits the file would tend to change it in the place that has an impact on server performance. Good call. I went ahead and committed both of these changes. FWIW, I'm not crazy about the redundancy that we have for filesystems in the fs.conf and the collections.db. Would anyone else be in favor of getting rid of the collections.db altogether? I looked at the code, and the only time we ever use that db is when we create or delete a collection, or to verify that an fsid is valid. pvfs2- showcoll uses it to print all of the collections, but for all these cases we could just use the entries in the fs.conf, or look for actual directories in the storage space. The advantage to removing the collections.db would be that trove_initialize wouldn't need a method for itself, independent of the individual collections. Thoughts? -sam For what its worth, I am going option C) here for the time being at our site. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveSyncData settings
This is similar to using O_DIRECT, which has also shown benefits. With alt aio, do we sync in the context of the I/O thread? Thanks, Rob Phil Carns wrote: One thing that we noticed while testing for storage challenge was that (and everyone correct me if I'm wrong here) enabling the data-sync causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had been written. I can imagine how this would help a SAN, but I'm perplexed how it helps localdisk, what buffer size are you playing with? We found that unless we were using HUGE (~size of cache on storage controller) flowbuffers that this caused way too many syncs/seeks on the disks and hurt performance quite a bit, maybe even as bad as 50% performance because things were not being optimized for our disk subsystems and we were issuing many small ops instead of fewer large ones. Granted I havent been able to get 2.6.0 building properly yet to test the latest out, but this was definitely the case for us on the 2.5 releases. You are definitely right about the data sync option causing a flush/sync on every sizeof(FLowBuffer). I don't really have a good explanation for why this doesn't seem to burn us anymore on local disk. Our settings are standard, except for: - 512KB flow buffer size - alt aio method - 512KB tcp buffers (with larger /proc tcp settings) This testing was done on some version prior to 2.6.0 also (I think it was a merge of some in-between release, so it is hard to pin down a version number). It may also have something to do with the controller and local disks being used? All of our local disk configurations are actually hardware raid 5 with some variety of the megaraid controller, and these are fairly new boxes. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveSyncData settings
One thing that we noticed while testing for storage challenge was that (and everyone correct me if I'm wrong here) enabling the data-sync causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had been written. I can imagine how this would help a SAN, but I'm perplexed how it helps localdisk, what buffer size are you playing with? We found that unless we were using HUGE (~size of cache on storage controller) flowbuffers that this caused way too many syncs/seeks on the disks and hurt performance quite a bit, maybe even as bad as 50% performance because things were not being optimized for our disk subsystems and we were issuing many small ops instead of fewer large ones. Granted I havent been able to get 2.6.0 building properly yet to test the latest out, but this was definitely the case for us on the 2.5 releases. You are definitely right about the data sync option causing a flush/sync on every sizeof(FLowBuffer). I don't really have a good explanation for why this doesn't seem to burn us anymore on local disk. Our settings are standard, except for: - 512KB flow buffer size - alt aio method - 512KB tcp buffers (with larger /proc tcp settings) This testing was done on some version prior to 2.6.0 also (I think it was a merge of some in-between release, so it is hard to pin down a version number). It may also have something to do with the controller and local disks being used? All of our local disk configurations are actually hardware raid 5 with some variety of the megaraid controller, and these are fairly new boxes. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
Re: [Pvfs2-developers] TroveSyncData settings
Phil Carns wrote: We recently ran some tests trying different sync settings in PVFS2. We ran into one pleasant surprise, although probably it is already obvious to others. Here is the setup: 12 clients 4 servers read/write test application, 100 MB operations, large files fibre channel SAN storage The test application is essentially the same as was used in the posting regarding kernel buffer sizes, although with different parameters in this environment. At any rate, to get to the point: with TroveSyncData=no (default settings): 173 MB/s with TroveSyncData=yes: 194 MB/s I think the issue is that if syncdata is turned off, then the buffer cache tends to get very full before it starts writing. This bursty behavior isn't doing the SAN any favors- it has a big cache on the back end and probably performs better with sustained writes that don't put so much sudden peak traffic on the HBA card. There are probably more sophisticated variations of this kind of tuning around (/proc vm settings, using direct io, etc.) but this is an easy config file change to get an extra 12% throughput. This setting is a little more unpredictable for local scsi disks- some combinations of application and node go faster but some go slower. Overall it seems better for our environment to just leave data syncing on for both SAN and local disk, but your mileage may vary. This is different from results that we have seen in the past (maybe a year ago or so) for local disk- it used to be a big penalty to sync every data operation. I'm not sure what exactly happened to change this (new dbpf design? alt-aio? better kernels?) but I'm not complaining :) One thing that we noticed while testing for storage challenge was that (and everyone correct me if I'm wrong here) enabling the data-sync causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had been written. I can imagine how this would help a SAN, but I'm perplexed how it helps localdisk, what buffer size are you playing with? We found that unless we were using HUGE (~size of cache on storage controller) flowbuffers that this caused way too many syncs/seeks on the disks and hurt performance quite a bit, maybe even as bad as 50% performance because things were not being optimized for our disk subsystems and we were issuing many small ops instead of fewer large ones. Granted I havent been able to get 2.6.0 building properly yet to test the latest out, but this was definitely the case for us on the 2.5 releases. +=Kyle -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers !DSPAM:456db371123088992556831! -- Kyle Schochenmaier [EMAIL PROTECTED] Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
[Pvfs2-developers] TroveSyncData settings
We recently ran some tests trying different sync settings in PVFS2. We ran into one pleasant surprise, although probably it is already obvious to others. Here is the setup: 12 clients 4 servers read/write test application, 100 MB operations, large files fibre channel SAN storage The test application is essentially the same as was used in the posting regarding kernel buffer sizes, although with different parameters in this environment. At any rate, to get to the point: with TroveSyncData=no (default settings): 173 MB/s with TroveSyncData=yes: 194 MB/s I think the issue is that if syncdata is turned off, then the buffer cache tends to get very full before it starts writing. This bursty behavior isn't doing the SAN any favors- it has a big cache on the back end and probably performs better with sustained writes that don't put so much sudden peak traffic on the HBA card. There are probably more sophisticated variations of this kind of tuning around (/proc vm settings, using direct io, etc.) but this is an easy config file change to get an extra 12% throughput. This setting is a little more unpredictable for local scsi disks- some combinations of application and node go faster but some go slower. Overall it seems better for our environment to just leave data syncing on for both SAN and local disk, but your mileage may vary. This is different from results that we have seen in the past (maybe a year ago or so) for local disk- it used to be a big penalty to sync every data operation. I'm not sure what exactly happened to change this (new dbpf design? alt-aio? better kernels?) but I'm not complaining :) -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
[Pvfs2-developers] tuning kernel buffer settings
We recently ran some tests that we thought would be interesting to share. We used the following setup: - single client - 16 servers - gigabit ethernet - read/write tests, with 40 GB files - using reads and writes of 100 MB each in size - varying number of processes running concurrently on the client This test application can be configured to be run with multiple processes and/or multiple client nodes. In this case we kept everything on a single client to focus on bottlenecks on that side. What we were looking at was the kernel buffer settings controlled in pint-dev-shared.h. By default PVFS2 uses 5 buffers of 4 MB each. After experimenting for a while, we made a few observations: - increasing the buffer size helped performance - using only 2 buffers (rather than 5) was sufficient to saturate the client when we were running multiple processes; adding more made only a marginal difference We found good results using 2 32MB buffers. Here are some comparisons between the standard settings and the 2 x 32MB configuration: results for RHEL4 (2.6 kernel): -- 5 x 4MB, 1 process: 83.6 MB/s 2 x 32MB, 1 process: 95.5 MB/s 5 x 4MB, 5 processes: 107.4 MB/s 2 x 32MB, 5 processes: 111.2 MB/s results for RHEL3 (2.4 kernel): --- 5 x 4MB, 1 process: 80.5 MB/s 2 x 32MB, 1 process: 90.7 MB/s 5 x 4MB, 5 processes: 91 MB/s 2 x 32MB, 5 processes: 103.5 MB/s A few comments based on those numbers: - on 3 out of 4 tests, we saw a 13-15% performance improvement by going to 2 32 MB buffers - the remaining test (5 process RHEL4) probably did not see as much improvement because we maxed out the network. In the past, netpipe has shown that we can get around 112 MB/s out of these nodes. - the RHEL3 nodes are on a different switch, so it is hard to say how much of the difference from RHEL3 to RHEL4 is due to network topology and how much is due to the kernel version It is also worth noting that even with this tuning, the single process tests are about 14% slower than the 5 process tests. I am guessing that this is due to a lack of pipelining, probably caused by two things: - the application only submitting one read/write at a time - the kernel module itself serializing when it breaks reads/writes into buffer sized chunks The latter could be addressed by either pipelining the I/O through the bufmap interface (so that a single read or write could keep multiple buffers busy) or by going to a system like Murali came up with for memory transfers a while back that isn't limited by buffer size. It would also be nice to have a way to set these buffer settings without recompiling- either via module options or via pvfs2-client-core command line options. For the time being we are going to hard code our tree to run with the 32 MB buffers. The 64 MB of RAM that this uses up (vs. 20 MB with the old settings) doesn't really matter for our standard node footprint. -Phil ___ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers