Re: [Pvfs2-developers] tuning kernel buffer settings

2006-11-29 Thread Murali Vilayannur

Hi Phil,
Attached patch fixes the read buffer bug that you had mentioned and
also implements the variable sized buffer counts and lengths that we
can pass as command line options to pvfs2-client-core.
I did not implement module time options for buffer size settings since
that is fairly complicated and not intuitive (client core driving the
buffer size and count settings
seems to make more sense to me).

So now we can do
pvfs2-client --desc-count= --desc-size=
in addition to the usual options.
With regards to the changes itself, this involved modifying the
parameters of an existing ioctl, and so we break binary compatibility,
but I don't think we have a policy of maintaining backward binary
compatibility, do we?
I have updated the compat ioctl code as well, so hopefully we won't
break in mixed 32-64 bit environments.
I have tested this out with various buffer sizes and counts on 32 bit
platforms only!
That said, I haven't done a comprehensive testing..so there may still be bugs..
Please review it and let me know if this looks ok.
BTW: patch is against pvfs-2.6.0..sorry abut that.
cvs ports are firewalled off at work and my internet at home is
temporarily not working.
thanks,
Murali


On 11/29/06, Murali Vilayannur <[EMAIL PROTECTED]> wrote:

Hi Phil,
Thanks for running these tests.
I think this buffer size will be dependant on the machine configuration right?
If we work out a simple formula for the buffer size based on say
memory b/w (and/or latency), network b/w (and/or latency), we could
plug that in as a sane default (bandwidth .
I did not realize that this setting will have such a noticable effect
on performance.
I can work on a patch to change these settings at runtime.
thanks,
Murali

> >> - single client
> >> - 16 servers
> >> - gigabit ethernet
> >> - read/write tests, with 40 GB files
> >> - using reads and writes of 100 MB each in size
> >> - varying number of processes running concurrently on the client
> >>
> >> This test application can be configured to be run with multiple
> >> processes and/or multiple client nodes.  In this case we kept
> >> everything on a single client to focus on bottlenecks on that side.
> >>
> >> What we were looking at was the kernel buffer settings controlled  in
> >> pint-dev-shared.h.  By default PVFS2 uses 5 buffers of 4 MB  each.
> >> After experimenting for a while, we made a few observations:
> >>
> >> - increasing the buffer size helped performance
> >> - using only 2 buffers (rather than 5) was sufficient to saturate  the
> >> client when we were running multiple processes; adding more  made only
> >> a marginal difference
> >>
> >> We found good results using 2 32MB buffers.  Here are some
> >> comparisons between the standard settings and the 2 x 32MB
> >> configuration:
> >>
> >> results for RHEL4 (2.6 kernel):
> >> --
> >> 5 x 4MB, 1 process: 83.6 MB/s
> >> 2 x 32MB, 1 process: 95.5 MB/s
> >>
> >> 5 x 4MB, 5 processes: 107.4 MB/s
> >> 2 x 32MB, 5 processes: 111.2 MB/s
> >>
> >> results for RHEL3 (2.4 kernel):
> >> ---
> >> 5 x 4MB, 1 process: 80.5 MB/s
> >> 2 x 32MB, 1 process: 90.7 MB/s
> >>
> >> 5 x 4MB, 5 processes: 91 MB/s
> >> 2 x 32MB, 5 processes: 103.5 MB/s
> >>
> >>
> >> A few comments based on those numbers:
> >>
> >> - on 3 out of 4 tests, we saw a 13-15% performance improvement by
> >> going to 2 32 MB buffers
> >> - the remaining test (5 process RHEL4) probably did not see as much
> >> improvement because we maxed out the network.  In the past, netpipe
> >> has shown that we can get around 112 MB/s out of these nodes.
> >> - the RHEL3 nodes are on a different switch, so it is hard to say  how
> >> much of the difference from RHEL3 to RHEL4 is due to network  topology
> >> and how much is due to the kernel version
> >>
> >> It is also worth noting that even with this tuning, the single
> >> process tests are about 14% slower than the 5 process tests.  I am
> >> guessing that this is due to a lack of pipelining, probably caused  by
> >> two things:
> >> - the application only submitting one read/write at a time
> >> - the kernel module itself serializing when it breaks reads/writes
> >> into buffer sized chunks
> >>
> >> The latter could be addressed by either pipelining the I/O through
> >> the bufmap interface (so that a single read or write could keep
> >> multiple buffers busy) or by going to a system like Murali came up
> >> with for memory transfers a while back that isn't limited by buffer
> >> size.
> >>
> >> It would also be nice to have a way to set these buffer settings
> >> without recompiling- either via module options or via pvfs2-client-
> >> core command line options.  For the time being we are going to hard
> >> code our tree to run with the 32 MB buffers.  The 64 MB of RAM that
> >> this uses up (vs. 20 MB with the old settings) doesn't really  matter
> >> for our standard node footprint.
> >>
> >> -Phil
> >> ___
> >> Pvfs2-deve

Re: [Pvfs2-developers] tuning kernel buffer settings

2006-11-29 Thread Murali Vilayannur

Hi Phil,
Thanks for running these tests.
I think this buffer size will be dependant on the machine configuration right?
If we work out a simple formula for the buffer size based on say
memory b/w (and/or latency), network b/w (and/or latency), we could
plug that in as a sane default (bandwidth .
I did not realize that this setting will have such a noticable effect
on performance.
I can work on a patch to change these settings at runtime.
thanks,
Murali


>> - single client
>> - 16 servers
>> - gigabit ethernet
>> - read/write tests, with 40 GB files
>> - using reads and writes of 100 MB each in size
>> - varying number of processes running concurrently on the client
>>
>> This test application can be configured to be run with multiple
>> processes and/or multiple client nodes.  In this case we kept
>> everything on a single client to focus on bottlenecks on that side.
>>
>> What we were looking at was the kernel buffer settings controlled  in
>> pint-dev-shared.h.  By default PVFS2 uses 5 buffers of 4 MB  each.
>> After experimenting for a while, we made a few observations:
>>
>> - increasing the buffer size helped performance
>> - using only 2 buffers (rather than 5) was sufficient to saturate  the
>> client when we were running multiple processes; adding more  made only
>> a marginal difference
>>
>> We found good results using 2 32MB buffers.  Here are some
>> comparisons between the standard settings and the 2 x 32MB
>> configuration:
>>
>> results for RHEL4 (2.6 kernel):
>> --
>> 5 x 4MB, 1 process: 83.6 MB/s
>> 2 x 32MB, 1 process: 95.5 MB/s
>>
>> 5 x 4MB, 5 processes: 107.4 MB/s
>> 2 x 32MB, 5 processes: 111.2 MB/s
>>
>> results for RHEL3 (2.4 kernel):
>> ---
>> 5 x 4MB, 1 process: 80.5 MB/s
>> 2 x 32MB, 1 process: 90.7 MB/s
>>
>> 5 x 4MB, 5 processes: 91 MB/s
>> 2 x 32MB, 5 processes: 103.5 MB/s
>>
>>
>> A few comments based on those numbers:
>>
>> - on 3 out of 4 tests, we saw a 13-15% performance improvement by
>> going to 2 32 MB buffers
>> - the remaining test (5 process RHEL4) probably did not see as much
>> improvement because we maxed out the network.  In the past, netpipe
>> has shown that we can get around 112 MB/s out of these nodes.
>> - the RHEL3 nodes are on a different switch, so it is hard to say  how
>> much of the difference from RHEL3 to RHEL4 is due to network  topology
>> and how much is due to the kernel version
>>
>> It is also worth noting that even with this tuning, the single
>> process tests are about 14% slower than the 5 process tests.  I am
>> guessing that this is due to a lack of pipelining, probably caused  by
>> two things:
>> - the application only submitting one read/write at a time
>> - the kernel module itself serializing when it breaks reads/writes
>> into buffer sized chunks
>>
>> The latter could be addressed by either pipelining the I/O through
>> the bufmap interface (so that a single read or write could keep
>> multiple buffers busy) or by going to a system like Murali came up
>> with for memory transfers a while back that isn't limited by buffer
>> size.
>>
>> It would also be nice to have a way to set these buffer settings
>> without recompiling- either via module options or via pvfs2-client-
>> core command line options.  For the time being we are going to hard
>> code our tree to run with the 32 MB buffers.  The 64 MB of RAM that
>> this uses up (vs. 20 MB with the old settings) doesn't really  matter
>> for our standard node footprint.
>>
>> -Phil
>> ___
>> Pvfs2-developers mailing list
>> Pvfs2-developers@beowulf-underground.org
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>

___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] read buffer bug

2006-11-29 Thread Murali Vilayannur

Hi guys,
I am really sorry about this. I am surprised we did not catch this
earlier. This was basically introduced by the file.c/bufmap.c cleanups
that I had done a while back.
Attached patch should fix this error.
thanks for the testcase, Phil!
Murali

On 11/29/06, Phil Carns <[EMAIL PROTECTED]> wrote:

I ran into a problem today with the 2.6.0 release.  This happened to
show up in the read04 LTP test, but not reliably.  I have attached a
test program that I think does trigger it reliably, though.

When run on ext3:

/home/pcarns> ./testme /tmp/foo.txt
read returned: 7, test_buf: hello   world

When run on pvfs2:

/home/pcarns> ./testme /mnt/pvfs2/foo.txt
read returned: 7, test_buf: hello

(or sometimes you might get garbage after the "hello")

The test program creates a string buffer with "goodbye world" stored in
it.  It then reads the string "hello  " out of a file into the beginning
of that buffer.   The result should be that the final resulting string
is "hello  world".

The trick that makes this fail is asking to read more than 7 bytes from
the file.

In this particular test program, we attempt to do a read of 255 bytes.
There are only 7 bytes in the file, though.  The return code from read
accurately reflects this.  However, rather than just fill in the first 7
bytes of the buffer, it looks like PVFS2 is overwriting the full 255
bytes.  What ends up in those trailing 248 bytes is somewhat random.

I suspect that somewhere in the kernel module there is a copy_to_user()
call that is copying the number of bytes requested by the read rather
than the number of bytes returned by the servers.

-Phil


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers






err
Description: Binary data
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Sam Lang


On Nov 29, 2006, at 3:44 PM, Rob Ross wrote:

That's what I was thinking -- that we could ask the I/O thread to  
do the syncing rather than stalling out other progress.


Wanna try it and see if it helps :)?

Rob

Phil Carns wrote:
No.  Both alt aio and the normal dbpf method sync as a seperate  
step after the aio list operation completes.
This is technically possible with alt aio, though- you would just  
need to pass a flag through to tell the I/O thread to sync after  
the pwrite().  That would probably be pretty helpful, so the trove  
worker thread doesn't get stuck waiting on the sync...

-Phil
Rob Ross wrote:

This is similar to using O_DIRECT, which has also shown benefits.

With alt aio, do we sync in the context of the I/O thread?

Thanks,

Rob

Phil Carns wrote:



One thing that we noticed while testing for storage challenge  
was that (and everyone correct me if I'm wrong here) enabling  
the data-sync causes a flush/sync to occur after every sizeof 
(FlowBuffer) bytes had been written.  I can imagine how this  
would help a SAN, but I'm perplexed how it helps localdisk,  
what buffer size are you playing with?
We found that unless we were using HUGE (~size of cache on  
storage controller) flowbuffers that this caused way too many  
syncs/seeks on the disks and hurt performance quite a bit,  
maybe even as bad as 50% performance because things were not  
being optimized for our disk subsystems and we were issuing  
many small ops instead of fewer large ones.


Granted I havent been able to get 2.6.0 building properly yet  
to test the latest out, but this was definitely the case for us  
on the 2.5 releases.



You are definitely right about the data sync option causing a  
flush/sync on every sizeof(FlowBuffer).


I had a note that we should change the default aio data-sync code to  
only sync at the end of an IO request, instead of for each trove  
operation (in FlowBufferSize chunks).  Doing this at the end of an  
io.sm seemed a little messy, but if/when we have request ids (hints)  
being passed to the trove interface, we could use that as a way to  
know to flush at the end.  In any case, it sounds like its better to  
flush early and often than at the end of a request?


From a user perspective, we usually tell people to enable data sync  
if they're concerned about losing data.  Now we're talking about  
getting better performance with data sync enabled (at least in some  
cases).  Does it make sense to sync even with data sync disabled if  
we can figure out that better performance would result?


-sam

  I don't really have a good explanation for why this doesn't  
seem to burn us anymore on local disk.  Our settings are  
standard, except for:


- 512KB flow buffer size
- alt aio method
- 512KB tcp buffers (with larger /proc tcp settings)

This testing was done on some version prior to 2.6.0 also (I  
think it was a merge of some in-between release, so it is hard  
to pin down a version number).


It may also have something to do with the controller and local  
disks being used?  All of our local disk configurations are  
actually hardware raid 5 with some variety of the megaraid  
controller, and these are fairly new boxes.


-Phil

___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
developers



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] tuning kernel buffer settings

2006-11-29 Thread Phil Carns
No, these test results predate the 2.6.0 release.  We are planning to 
test with the threaded pvfs2-client later on, though.


-Phil

Sam Lang wrote:


These are great results Phil.  Its nice to have you guys doing this  
testing.  Did you get a chance to run any of your tests with the  
threaded version of pvfs2-client?  I added -threaded option, which  runs 
pvfs2-client-core-threaded instead of pvfs2-client-core.  For  the case 
where you're running multiple processes concurrently, I  wonder if you 
would see some improvement, although Dean didn't see  any when he tried 
it with one process doing concurrent reads/writes  from multiple 
threads.  Just a thought.


I'd also be curious what affect the mallocs have on performance.  I  
added a fix to Walt's branch for the allocation of all the lookup  
segment contexts on every request from the VFS, but that hasn't  
propagated into HEAD yet.


-sam

On Nov 29, 2006, at 9:58 AM, Phil Carns wrote:

We recently ran some tests that we thought would be interesting to  
share.  We used the following setup:


- single client
- 16 servers
- gigabit ethernet
- read/write tests, with 40 GB files
- using reads and writes of 100 MB each in size
- varying number of processes running concurrently on the client

This test application can be configured to be run with multiple  
processes and/or multiple client nodes.  In this case we kept  
everything on a single client to focus on bottlenecks on that side.


What we were looking at was the kernel buffer settings controlled  in 
pint-dev-shared.h.  By default PVFS2 uses 5 buffers of 4 MB  each.  
After experimenting for a while, we made a few observations:


- increasing the buffer size helped performance
- using only 2 buffers (rather than 5) was sufficient to saturate  the 
client when we were running multiple processes; adding more  made only 
a marginal difference


We found good results using 2 32MB buffers.  Here are some  
comparisons between the standard settings and the 2 x 32MB  
configuration:


results for RHEL4 (2.6 kernel):
--
5 x 4MB, 1 process: 83.6 MB/s
2 x 32MB, 1 process: 95.5 MB/s

5 x 4MB, 5 processes: 107.4 MB/s
2 x 32MB, 5 processes: 111.2 MB/s

results for RHEL3 (2.4 kernel):
---
5 x 4MB, 1 process: 80.5 MB/s
2 x 32MB, 1 process: 90.7 MB/s

5 x 4MB, 5 processes: 91 MB/s
2 x 32MB, 5 processes: 103.5 MB/s


A few comments based on those numbers:

- on 3 out of 4 tests, we saw a 13-15% performance improvement by  
going to 2 32 MB buffers
- the remaining test (5 process RHEL4) probably did not see as much  
improvement because we maxed out the network.  In the past, netpipe  
has shown that we can get around 112 MB/s out of these nodes.
- the RHEL3 nodes are on a different switch, so it is hard to say  how 
much of the difference from RHEL3 to RHEL4 is due to network  topology 
and how much is due to the kernel version


It is also worth noting that even with this tuning, the single  
process tests are about 14% slower than the 5 process tests.  I am  
guessing that this is due to a lack of pipelining, probably caused  by 
two things:

- the application only submitting one read/write at a time
- the kernel module itself serializing when it breaks reads/writes  
into buffer sized chunks


The latter could be addressed by either pipelining the I/O through  
the bufmap interface (so that a single read or write could keep  
multiple buffers busy) or by going to a system like Murali came up  
with for memory transfers a while back that isn't limited by buffer  
size.


It would also be nice to have a way to set these buffer settings  
without recompiling- either via module options or via pvfs2-client- 
core command line options.  For the time being we are going to hard  
code our tree to run with the 32 MB buffers.  The 64 MB of RAM that  
this uses up (vs. 20 MB with the old settings) doesn't really  matter 
for our standard node footprint.


-Phil
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers





___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] threaded library dependency

2006-11-29 Thread Sam Lang


On Nov 28, 2006, at 7:59 PM, Phil Carns wrote:


$(KERNAPPSTHR): %: %.o $(LIBRARIES_THREADED)


committed.  Thanks for the report Phil.

-sam


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveMethod configuration parameter

2006-11-29 Thread Sam Lang


On Nov 29, 2006, at 4:20 PM, Phil Carns wrote:




Good call.  I went ahead and committed both of these changes.


Great- thanks!

FWIW, I'm not crazy about the redundancy that we have for  
filesystems  in the fs.conf and the collections.db.  Would anyone  
else be in favor  of getting rid of the collections.db  
altogether?  I looked at the  code, and the only time we ever use  
that db is when we create or  delete a collection, or to verify  
that an fsid is valid.  pvfs2- showcoll uses it to print all of  
the collections, but for all these  cases we could just use the  
entries in the fs.conf, or look for  actual directories in the  
storage space.  The advantage to removing  the collections.db  
would be that trove_initialize wouldn't need a  method for itself,  
independent of the individual collections.  Thoughts?


That seems reasonable to me.  Can this be done without breaking  
storage space compatibility?  It seems like it should- newer  
servers would probably just ignore the old collections.db if it is  
still laying around.



Yes I think so.

-sam


-Phil



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] tuning kernel buffer settings

2006-11-29 Thread Sam Lang


These are great results Phil.  Its nice to have you guys doing this  
testing.  Did you get a chance to run any of your tests with the  
threaded version of pvfs2-client?  I added -threaded option, which  
runs pvfs2-client-core-threaded instead of pvfs2-client-core.  For  
the case where you're running multiple processes concurrently, I  
wonder if you would see some improvement, although Dean didn't see  
any when he tried it with one process doing concurrent reads/writes  
from multiple threads.  Just a thought.


I'd also be curious what affect the mallocs have on performance.  I  
added a fix to Walt's branch for the allocation of all the lookup  
segment contexts on every request from the VFS, but that hasn't  
propagated into HEAD yet.


-sam

On Nov 29, 2006, at 9:58 AM, Phil Carns wrote:

We recently ran some tests that we thought would be interesting to  
share.  We used the following setup:


- single client
- 16 servers
- gigabit ethernet
- read/write tests, with 40 GB files
- using reads and writes of 100 MB each in size
- varying number of processes running concurrently on the client

This test application can be configured to be run with multiple  
processes and/or multiple client nodes.  In this case we kept  
everything on a single client to focus on bottlenecks on that side.


What we were looking at was the kernel buffer settings controlled  
in pint-dev-shared.h.  By default PVFS2 uses 5 buffers of 4 MB  
each.  After experimenting for a while, we made a few observations:


- increasing the buffer size helped performance
- using only 2 buffers (rather than 5) was sufficient to saturate  
the client when we were running multiple processes; adding more  
made only a marginal difference


We found good results using 2 32MB buffers.  Here are some  
comparisons between the standard settings and the 2 x 32MB  
configuration:


results for RHEL4 (2.6 kernel):
--
5 x 4MB, 1 process: 83.6 MB/s
2 x 32MB, 1 process: 95.5 MB/s

5 x 4MB, 5 processes: 107.4 MB/s
2 x 32MB, 5 processes: 111.2 MB/s

results for RHEL3 (2.4 kernel):
---
5 x 4MB, 1 process: 80.5 MB/s
2 x 32MB, 1 process: 90.7 MB/s

5 x 4MB, 5 processes: 91 MB/s
2 x 32MB, 5 processes: 103.5 MB/s


A few comments based on those numbers:

- on 3 out of 4 tests, we saw a 13-15% performance improvement by  
going to 2 32 MB buffers
- the remaining test (5 process RHEL4) probably did not see as much  
improvement because we maxed out the network.  In the past, netpipe  
has shown that we can get around 112 MB/s out of these nodes.
- the RHEL3 nodes are on a different switch, so it is hard to say  
how much of the difference from RHEL3 to RHEL4 is due to network  
topology and how much is due to the kernel version


It is also worth noting that even with this tuning, the single  
process tests are about 14% slower than the 5 process tests.  I am  
guessing that this is due to a lack of pipelining, probably caused  
by two things:

- the application only submitting one read/write at a time
- the kernel module itself serializing when it breaks reads/writes  
into buffer sized chunks


The latter could be addressed by either pipelining the I/O through  
the bufmap interface (so that a single read or write could keep  
multiple buffers busy) or by going to a system like Murali came up  
with for memory transfers a while back that isn't limited by buffer  
size.


It would also be nice to have a way to set these buffer settings  
without recompiling- either via module options or via pvfs2-client- 
core command line options.  For the time being we are going to hard  
code our tree to run with the 32 MB buffers.  The 64 MB of RAM that  
this uses up (vs. 20 MB with the old settings) doesn't really  
matter for our standard node footprint.


-Phil
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveMethod configuration parameter

2006-11-29 Thread Phil Carns



Good call.  I went ahead and committed both of these changes.


Great- thanks!

FWIW, I'm not crazy about the redundancy that we have for filesystems  
in the fs.conf and the collections.db.  Would anyone else be in favor  
of getting rid of the collections.db altogether?  I looked at the  code, 
and the only time we ever use that db is when we create or  delete a 
collection, or to verify that an fsid is valid.  pvfs2- showcoll uses it 
to print all of the collections, but for all these  cases we could just 
use the entries in the fs.conf, or look for  actual directories in the 
storage space.  The advantage to removing  the collections.db would be 
that trove_initialize wouldn't need a  method for itself, independent of 
the individual collections.  Thoughts?


That seems reasonable to me.  Can this be done without breaking storage 
space compatibility?  It seems like it should- newer servers would 
probably just ignore the old collections.db if it is still laying around.


-Phil
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Rob Ross
That's what I was thinking -- that we could ask the I/O thread to do the 
syncing rather than stalling out other progress.


Wanna try it and see if it helps :)?

Rob

Phil Carns wrote:
No.  Both alt aio and the normal dbpf method sync as a seperate step 
after the aio list operation completes.


This is technically possible with alt aio, though- you would just need 
to pass a flag through to tell the I/O thread to sync after the 
pwrite().  That would probably be pretty helpful, so the trove worker 
thread doesn't get stuck waiting on the sync...


-Phil


Rob Ross wrote:

This is similar to using O_DIRECT, which has also shown benefits.

With alt aio, do we sync in the context of the I/O thread?

Thanks,

Rob

Phil Carns wrote:



One thing that we noticed while testing for storage challenge was 
that (and everyone correct me if I'm wrong here) enabling the 
data-sync causes a flush/sync to occur after every 
sizeof(FlowBuffer) bytes had been written.  I can imagine how this 
would help a SAN, but I'm perplexed how it helps localdisk, what 
buffer size are you playing with?
We found that unless we were using HUGE (~size of cache on storage 
controller) flowbuffers that this caused way too many syncs/seeks on 
the disks and hurt performance quite a bit, maybe even as bad as 50% 
performance because things were not being optimized for our disk 
subsystems and we were issuing many small ops instead of fewer large 
ones.


Granted I havent been able to get 2.6.0 building properly yet to 
test the latest out, but this was definitely the case for us on the 
2.5 releases.



You are definitely right about the data sync option causing a 
flush/sync on every sizeof(FLowBuffer).  I don't really have a good 
explanation for why this doesn't seem to burn us anymore on local 
disk.  Our settings are standard, except for:


- 512KB flow buffer size
- alt aio method
- 512KB tcp buffers (with larger /proc tcp settings)

This testing was done on some version prior to 2.6.0 also (I think it 
was a merge of some in-between release, so it is hard to pin down a 
version number).


It may also have something to do with the controller and local disks 
being used?  All of our local disk configurations are actually 
hardware raid 5 with some variety of the megaraid controller, and 
these are fairly new boxes.


-Phil

___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers




___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Phil Carns
No.  Both alt aio and the normal dbpf method sync as a seperate step 
after the aio list operation completes.


This is technically possible with alt aio, though- you would just need 
to pass a flag through to tell the I/O thread to sync after the 
pwrite().  That would probably be pretty helpful, so the trove worker 
thread doesn't get stuck waiting on the sync...


-Phil


Rob Ross wrote:

This is similar to using O_DIRECT, which has also shown benefits.

With alt aio, do we sync in the context of the I/O thread?

Thanks,

Rob

Phil Carns wrote:



One thing that we noticed while testing for storage challenge was 
that (and everyone correct me if I'm wrong here) enabling the 
data-sync causes a flush/sync to occur after every sizeof(FlowBuffer) 
bytes had been written.  I can imagine how this would help a SAN, but 
I'm perplexed how it helps localdisk, what buffer size are you 
playing with?
We found that unless we were using HUGE (~size of cache on storage 
controller) flowbuffers that this caused way too many syncs/seeks on 
the disks and hurt performance quite a bit, maybe even as bad as 50% 
performance because things were not being optimized for our disk 
subsystems and we were issuing many small ops instead of fewer large 
ones.


Granted I havent been able to get 2.6.0 building properly yet to test 
the latest out, but this was definitely the case for us on the 2.5 
releases.



You are definitely right about the data sync option causing a 
flush/sync on every sizeof(FLowBuffer).  I don't really have a good 
explanation for why this doesn't seem to burn us anymore on local 
disk.  Our settings are standard, except for:


- 512KB flow buffer size
- alt aio method
- 512KB tcp buffers (with larger /proc tcp settings)

This testing was done on some version prior to 2.6.0 also (I think it 
was a merge of some in-between release, so it is hard to pin down a 
version number).


It may also have something to do with the controller and local disks 
being used?  All of our local disk configurations are actually 
hardware raid 5 with some variety of the megaraid controller, and 
these are fairly new boxes.


-Phil

___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


[Pvfs2-developers] read buffer bug

2006-11-29 Thread Phil Carns
I ran into a problem today with the 2.6.0 release.  This happened to 
show up in the read04 LTP test, but not reliably.  I have attached a 
test program that I think does trigger it reliably, though.


When run on ext3:

/home/pcarns> ./testme /tmp/foo.txt
read returned: 7, test_buf: hello   world

When run on pvfs2:

/home/pcarns> ./testme /mnt/pvfs2/foo.txt
read returned: 7, test_buf: hello

(or sometimes you might get garbage after the "hello")

The test program creates a string buffer with "goodbye world" stored in 
it.  It then reads the string "hello  " out of a file into the beginning 
of that buffer.   The result should be that the final resulting string 
is "hello  world".


The trick that makes this fail is asking to read more than 7 bytes from 
the file.


In this particular test program, we attempt to do a read of 255 bytes. 
There are only 7 bytes in the file, though.  The return code from read 
accurately reflects this.  However, rather than just fill in the first 7 
bytes of the buffer, it looks like PVFS2 is overwriting the full 255 
bytes.  What ends up in those trailing 248 bytes is somewhat random.


I suspect that somewhere in the kernel module there is a copy_to_user() 
call that is copying the number of bytes requested by the read rather 
than the number of bytes returned by the servers.


-Phil
#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)	 
{
   int fd = 0;
   int ret = 0;
   char test_string[] = "hello  ";
   char test_buf[256];

   if(argc != 2)
   {
  fprintf(stderr, "Usage: %s \n", argv[0]);
  return(-1);
   }

   fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, (S_IRUSR | S_IWUSR));
   if(fd < 0)
   {
  perror("open");
  return(-1);
   }

   ret = write(fd, test_string, strlen(test_string));
   if(ret != strlen(test_string))
   {
  fprintf(stderr, "Error: write failed.\n");
  return(-1);
   }

   /* put some garbage in the buffer so we can detect if something goes wrong */
   strcpy(test_buf, "goodbye world");
   
   lseek(fd, 0, SEEK_SET);
   ret = read(fd, test_buf, 255);
   
   printf("read returned: %d, test_buf: %s\n", ret, test_buf);

   close(fd);

   return(0);
}
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveMethod configuration parameter

2006-11-29 Thread Sam Lang


On Nov 28, 2006, at 12:34 PM, Phil Carns wrote:



The main issue here is that the trove initialize doesn't know  
which  method to use, so there needs to be a global default.   
Each  collection also gets its own method as you've noticed, so  
you can  specify that on a per collection basis.


Ok- that makes sense.  Thanks for the confirmation.

The pvfs2-genconfig utility places TroveMethod option in the   
 section.  Is this intentional?  For example, if  
someone  edited a configuration file created by genconfig, they  
might try to  change the value of TroveMethod from "dbpf" to "alt- 
aio".  However,  it doesn't look like this would have any real  
impact.  It would  change the parameter to trove's initialize()  
function, but the  collection_lookup() would still default to the  
dbpf method.
Hmm.. that's true.  The config format and the trove interfaces  
don't  match up real well.  Ideally, there wouldn't be a  
TroveMethod in the   section at all, and the trove  
initialize would work on a  per-collection basis, but right now we  
store collection info in both  the config file and inside the  
collections database.  To be able to  open and read from the  
collections database, we need a trove method.
Maybe it makes sense to use the default from the   
section  if one isn't specified for that collection?  I'm not  
crazy about this  solution but it seems like a reasonable  
alternative.


That seems a little more intuitive.  Most people editing the config  
file would probably expect behavior like that (not knowing what the  
trove stack looks like).  Here are a couple of other ideas to throw  
out there:


B) split this into two keywords (hopefully with better names than  
the examples below):


- TroveCollectionMethod: valid _only_ in the StorageHints section
- TroveStorageSpaceMethod: valid _only_ in the Defaults section

C) keep the existing scheme, but just with two tweaks:

- clarify the comments/documentation in server-config.c to indicate  
that the parameter means something a little different depending on  
where it is used
- change pvfs2-genconfig to emit the "TroveMethod dbpf" line in the  
StorageHints section rather than the Defaults section.  That way  
someone  who comes along later and edits the file would tend to  
change it in the place that has an impact on server performance.


Good call.  I went ahead and committed both of these changes.

FWIW, I'm not crazy about the redundancy that we have for filesystems  
in the fs.conf and the collections.db.  Would anyone else be in favor  
of getting rid of the collections.db altogether?  I looked at the  
code, and the only time we ever use that db is when we create or  
delete a collection, or to verify that an fsid is valid.  pvfs2- 
showcoll uses it to print all of the collections, but for all these  
cases we could just use the entries in the fs.conf, or look for  
actual directories in the storage space.  The advantage to removing  
the collections.db would be that trove_initialize wouldn't need a  
method for itself, independent of the individual collections.  Thoughts?


-sam



For what its worth, I am going option C) here for the time being at  
our site.


-Phil



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Rob Ross

This is similar to using O_DIRECT, which has also shown benefits.

With alt aio, do we sync in the context of the I/O thread?

Thanks,

Rob

Phil Carns wrote:


One thing that we noticed while testing for storage challenge was that 
(and everyone correct me if I'm wrong here) enabling the data-sync 
causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had 
been written.  I can imagine how this would help a SAN, but I'm 
perplexed how it helps localdisk, what buffer size are you playing with?
We found that unless we were using HUGE (~size of cache on storage 
controller) flowbuffers that this caused way too many syncs/seeks on 
the disks and hurt performance quite a bit, maybe even as bad as 50% 
performance because things were not being optimized for our disk 
subsystems and we were issuing many small ops instead of fewer large 
ones.


Granted I havent been able to get 2.6.0 building properly yet to test 
the latest out, but this was definitely the case for us on the 2.5 
releases.


You are definitely right about the data sync option causing a flush/sync 
on every sizeof(FLowBuffer).  I don't really have a good explanation for 
why this doesn't seem to burn us anymore on local disk.  Our settings 
are standard, except for:


- 512KB flow buffer size
- alt aio method
- 512KB tcp buffers (with larger /proc tcp settings)

This testing was done on some version prior to 2.6.0 also (I think it 
was a merge of some in-between release, so it is hard to pin down a 
version number).


It may also have something to do with the controller and local disks 
being used?  All of our local disk configurations are actually hardware 
raid 5 with some variety of the megaraid controller, and these are 
fairly new boxes.


-Phil

___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Phil Carns


One thing that we noticed while testing for storage challenge was that 
(and everyone correct me if I'm wrong here) enabling the data-sync 
causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had 
been written.  I can imagine how this would help a SAN, but I'm 
perplexed how it helps localdisk, what buffer size are you playing with?
We found that unless we were using HUGE (~size of cache on storage 
controller) flowbuffers that this caused way too many syncs/seeks on the 
disks and hurt performance quite a bit, maybe even as bad as 50% 
performance because things were not being optimized for our disk 
subsystems and we were issuing many small ops instead of fewer large ones.


Granted I havent been able to get 2.6.0 building properly yet to test 
the latest out, but this was definitely the case for us on the 2.5 
releases.


You are definitely right about the data sync option causing a flush/sync 
on every sizeof(FLowBuffer).  I don't really have a good explanation for 
why this doesn't seem to burn us anymore on local disk.  Our settings 
are standard, except for:


- 512KB flow buffer size
- alt aio method
- 512KB tcp buffers (with larger /proc tcp settings)

This testing was done on some version prior to 2.6.0 also (I think it 
was a merge of some in-between release, so it is hard to pin down a 
version number).


It may also have something to do with the controller and local disks 
being used?  All of our local disk configurations are actually hardware 
raid 5 with some variety of the megaraid controller, and these are 
fairly new boxes.


-Phil

___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Kyle Schochenmaier

Phil Carns wrote:
We recently ran some tests trying different sync settings in PVFS2.  
We ran into one pleasant surprise, although probably it is already 
obvious to others.  Here is the setup:


12 clients
4 servers
read/write test application, 100 MB operations, large files
fibre channel SAN storage

The test application is essentially the same as was used in the 
posting regarding kernel buffer sizes, although with different 
parameters in this environment.


At any rate, to get to the point:

with TroveSyncData=no (default settings): 173 MB/s
with TroveSyncData=yes: 194 MB/s

I think the issue is that if syncdata is turned off, then the buffer 
cache tends to get very full before it starts writing.  This bursty 
behavior isn't doing the SAN any favors- it has a big cache on the 
back end and probably performs better with sustained writes that don't 
put so much sudden peak traffic on the HBA card.


There are probably more sophisticated variations of this kind of 
tuning around (/proc vm settings, using direct io, etc.) but this is 
an easy config file change to get an extra 12% throughput.


This setting is a little more unpredictable for local scsi disks- some 
combinations of application and node go faster but some go slower. 
Overall it seems better for our environment to just leave data syncing 
on for both SAN and local disk, but your mileage may vary.


This is different from results that we have seen in the past (maybe a 
year ago or so) for local disk- it used to be a big penalty to sync 
every data operation.  I'm not sure what exactly happened to change 
this (new dbpf design?  alt-aio?  better kernels?) but I'm not 
complaining :)


One thing that we noticed while testing for storage challenge was that 
(and everyone correct me if I'm wrong here) enabling the data-sync 
causes a flush/sync to occur after every sizeof(FlowBuffer) bytes had 
been written.  I can imagine how this would help a SAN, but I'm 
perplexed how it helps localdisk, what buffer size are you playing with? 

We found that unless we were using HUGE (~size of cache on storage 
controller) flowbuffers that this caused way too many syncs/seeks on the 
disks and hurt performance quite a bit, maybe even as bad as 50% 
performance because things were not being optimized for our disk 
subsystems and we were issuing many small ops instead of fewer large ones.


Granted I havent been able to get 2.6.0 building properly yet to test 
the latest out, but this was definitely the case for us on the 2.5 releases.


+=Kyle

-Phil
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

!DSPAM:456db371123088992556831!




--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


[Pvfs2-developers] TroveSyncData settings

2006-11-29 Thread Phil Carns
We recently ran some tests trying different sync settings in PVFS2.  We 
ran into one pleasant surprise, although probably it is already obvious 
to others.  Here is the setup:


12 clients
4 servers
read/write test application, 100 MB operations, large files
fibre channel SAN storage

The test application is essentially the same as was used in the posting 
regarding kernel buffer sizes, although with different parameters in 
this environment.


At any rate, to get to the point:

with TroveSyncData=no (default settings): 173 MB/s
with TroveSyncData=yes: 194 MB/s

I think the issue is that if syncdata is turned off, then the buffer 
cache tends to get very full before it starts writing.  This bursty 
behavior isn't doing the SAN any favors- it has a big cache on the back 
end and probably performs better with sustained writes that don't put so 
much sudden peak traffic on the HBA card.


There are probably more sophisticated variations of this kind of tuning 
around (/proc vm settings, using direct io, etc.) but this is an easy 
config file change to get an extra 12% throughput.


This setting is a little more unpredictable for local scsi disks- some 
combinations of application and node go faster but some go slower. 
Overall it seems better for our environment to just leave data syncing 
on for both SAN and local disk, but your mileage may vary.


This is different from results that we have seen in the past (maybe a 
year ago or so) for local disk- it used to be a big penalty to sync 
every data operation.  I'm not sure what exactly happened to change this 
(new dbpf design?  alt-aio?  better kernels?) but I'm not complaining :)


-Phil
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


[Pvfs2-developers] tuning kernel buffer settings

2006-11-29 Thread Phil Carns
We recently ran some tests that we thought would be interesting to 
share.  We used the following setup:


- single client
- 16 servers
- gigabit ethernet
- read/write tests, with 40 GB files
- using reads and writes of 100 MB each in size
- varying number of processes running concurrently on the client

This test application can be configured to be run with multiple 
processes and/or multiple client nodes.  In this case we kept everything 
on a single client to focus on bottlenecks on that side.


What we were looking at was the kernel buffer settings controlled in 
pint-dev-shared.h.  By default PVFS2 uses 5 buffers of 4 MB each.  After 
experimenting for a while, we made a few observations:


- increasing the buffer size helped performance
- using only 2 buffers (rather than 5) was sufficient to saturate the 
client when we were running multiple processes; adding more made only a 
marginal difference


We found good results using 2 32MB buffers.  Here are some comparisons 
between the standard settings and the 2 x 32MB configuration:


results for RHEL4 (2.6 kernel):
--
5 x 4MB, 1 process: 83.6 MB/s
2 x 32MB, 1 process: 95.5 MB/s

5 x 4MB, 5 processes: 107.4 MB/s
2 x 32MB, 5 processes: 111.2 MB/s

results for RHEL3 (2.4 kernel):
---
5 x 4MB, 1 process: 80.5 MB/s
2 x 32MB, 1 process: 90.7 MB/s

5 x 4MB, 5 processes: 91 MB/s
2 x 32MB, 5 processes: 103.5 MB/s


A few comments based on those numbers:

- on 3 out of 4 tests, we saw a 13-15% performance improvement by going 
to 2 32 MB buffers
- the remaining test (5 process RHEL4) probably did not see as much 
improvement because we maxed out the network.  In the past, netpipe has 
shown that we can get around 112 MB/s out of these nodes.
- the RHEL3 nodes are on a different switch, so it is hard to say how 
much of the difference from RHEL3 to RHEL4 is due to network topology 
and how much is due to the kernel version


It is also worth noting that even with this tuning, the single process 
tests are about 14% slower than the 5 process tests.  I am guessing that 
this is due to a lack of pipelining, probably caused by two things:

- the application only submitting one read/write at a time
- the kernel module itself serializing when it breaks reads/writes into 
buffer sized chunks


The latter could be addressed by either pipelining the I/O through the 
bufmap interface (so that a single read or write could keep multiple 
buffers busy) or by going to a system like Murali came up with for 
memory transfers a while back that isn't limited by buffer size.


It would also be nice to have a way to set these buffer settings without 
recompiling- either via module options or via pvfs2-client-core command 
line options.  For the time being we are going to hard code our tree to 
run with the 32 MB buffers.  The 64 MB of RAM that this uses up (vs. 20 
MB with the old settings) doesn't really matter for our standard node 
footprint.


-Phil
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers