Re: [Pvfs2-developers] BMI questions

2006-12-01 Thread Sam Lang


On Nov 30, 2006, at 6:58 PM, Scott Atchley wrote:


On Nov 30, 2006, at 4:31 PM, Sam Lang wrote:

Right now all our operations (or transactions, as you call them)  
start with an unexpected message from the client, and end with an  
expected message from the server.  I don't know if that's a design  
requirement of BMI though, or just an artifact of how we use it in  
PVFS.  I _think_ the BMI interfaces were meant to allow expected  
messages in either direction in any order, and its left up to the  
upper layers to make sure they get posted right, but again, I  
would have to defer to one of the BMI sages.


Hmmm. I assumed that for any operation, that there would be a back  
and forth between client and server ending with a expected send  
from server to the client:


Client  Server
   | unex  |
   |-->|
   |   |
   |  ex   |
   |<--|
   |   |
   |  ex   |
   |-->|
   |   |
   |  ex   |
   |<--|
   |   |

with a minimum of unexpected client to server followed by an  
expected from server to client. If this is the case I might be able  
to do a simple flow control on the client using a reference count  
(increment on send to server S and decrement on receive from S).


Are you saying that a single operation may not ping pong back and  
forth but have multiple expected sends in a single direction?


Client  Server
   | unex  |
   |-->|
   |   |
   |  ex   |
   |<--|
   |   |
   |  ex   |
   |-->|
   |   |
   |  ex   |
   |-->|
   |   |
   |  ex   |
   |-->|
   |   |
   |  ex   |
   |<--|
   |   |

If so, would each of the receives (and matching sends) use  
different tags? Also, this case presents a resource starvation  
risk. Since the BMI method does not know about the entire operation  
(how many sends/receives), it is possible that it could start the  
operation but not be able to get the additional resources for the  
subsequent sends/receives to complete it.


Your example above is currently how writes work.  The client sends an  
unexpected message to the server (a control message for the IO, file  
info, size of the IO, etc.), which posts an expected receive, and  
then sends an expected back to the client.  The client posts a  
receive for the expected before sending the unexpected.  After the  
receive of the expected message at the client completes (this is a  
'ready for IO' message from the server), It posts a send of the  
actual IO (this will be up to FlowBufferSize).  Once that send  
completes, it posts another one, and assumes that the server has  
already posted another receive (based on the size of the entire IO).   
Once all the IO has completed at the server (including pushing the  
data to disk), the server sends a response ack message, which the  
client posted a receive for before doing any of the actual IO.


I think the ordering of posts goes something like this for a write:

client: server:
 




post_unexp
post_recv(ready_ack)
post_send(IO_request)

wait(IO_request)

post_recv(IO1)

post_send(ready_ack)
wait(ready_ack)
post_send(IO1)
post_recv(write_ack)

wait(IO1)

post_recv(IO2)
wait_for_send_completion(IO1)
post_send(IO2)

wait(IO2)

post_recv(IO3)
... 
...
post_send(ION)

wait(ION)

post_send(write_ack)
wait(write_ack)


It looks like the flow code on the server doesn't actually post the  
next recv of IO (IO2), until the first recv has completed (IO1), so  
its possible that the client posts (and starts) the next send before  
the server posts the next receive, although its probably unlikely.   
The server posts the next recv (IO2) once the first recv completes,  
as well as posting

Re: [Pvfs2-developers] BMI questions

2006-12-01 Thread Scott Atchley

On Dec 1, 2006, at 4:33 AM, Sam Lang wrote:

Your example above is currently how writes work.  The client sends  
an unexpected message to the server (a control message for the IO,  
file info, size of the IO, etc.), which posts an expected receive,  
and then sends an expected back to the client.  The client posts a  
receive for the expected before sending the unexpected.  After the  
receive of the expected message at the client completes (this is a  
'ready for IO' message from the server), It posts a send of the  
actual IO (this will be up to FlowBufferSize).  Once that send  
completes, it posts another one, and assumes that the server has  
already posted another receive (based on the size of the entire  
IO).  Once all the IO has completed at the server (including  
pushing the data to disk), the server sends a response ack message,  
which the client posted a receive for before doing any of the  
actual IO.


Ok.

It looks like the flow code on the server doesn't actually post the  
next recv of IO (IO2), until the first recv has completed (IO1), so  
its possible that the client posts (and starts) the next send  
before the server posts the next receive, although its probably  
unlikely.


If IO operations are always > 32 KB, I would agree. But if any are <=  
32 KB, MX will buffer them on the send side and complete immediately.  
The client could then post another even if MX is in the middle of  
delivering the first one. I can override this behavior (use mx_issend 
()) or use credits for control flow.



Each BMI receive uses a separate buffer (up to a max of 8 buffers).


Does this mean that at most, the client will post 8 IO sends per  
operation?


Every time a bmi recv completes, two things happen, the associated  
trove write is posted, and a new bmi recv is posted.  So over time,  
bmi receives will get posted at the server before bmi sends get  
posted at the client, but the second and maybe third bmi receives  
posted may be posted after the bmi sends at the client.


To answer your specific questions:

The same bmi tag is passed to each of the post_send and post_recv  
calls for the entire IO operation.


I can live with this as long as only one receive is posted at a time  
using a specific tag.


As to hitting resource limits, the client doesn't post the next  
send until the previous send has completed.  I think with enough IO  
operations from different clients happening concurrently, it may be  
possible to run into the resource issues you speak of, but I need  
to verify that.


Definitely.

Yes it always posts a receive for an expected message.  For most  
expected messages the receive is guaranteed to be posted before the  
peer posts the send.  That doesn't appear to guaranteed in the IO  
case though, as I mentioned above.


Hope this helps.

-sam


Tremendously. In one of the diagrams above, you seem to indicate that  
the server will post receives for unexpected messages. Is this the  
case? If so, does it simply use BMI_method_post_recv()? With what  
tag, etc.?


From the IB code, it looks like the server does not post an  
unexpected, but relies on the BMI method to receive the message and  
put it in a queue, and then return it when BMI_method_test_unexpected 
() is called. Am I reading this wrong?


Scott
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


[Pvfs2-developers] About state machines

2006-12-01 Thread Kwangho CHA
Dear All,
 
I'm trying to modify PVFS for my small experiment but I'm not familiar with the 
state machine codes (.sm files). 
 
Of course I read the part of 'Writing State Machines' in the 'add-server-req' 
file but I'd like know that "Is there any detail document about state machine 
codes?"
 
Thanks in advance.
 
Kwangho CHA.


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] About state machines

2006-12-01 Thread Julian Martin Kunkel
Hi,
you can find a excerpt of the documentation I made and still make for PVFS2 
here:
http://www.rzuser.uni-heidelberg.de/~jkunkel2/pvfs2-doc.pdf

Although it is not finished yet and contains some typos it documents parts of 
the interaction between system call and server statemachine invocation...

Julian
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] bufmap_copy_iovec errors?

2006-12-01 Thread Sam Lang


On Nov 30, 2006, at 6:11 PM, Murali Vilayannur wrote:


Hi Sam,
Looks good.
It fixes the I/O errors on ppc64, right?
Check it in! Also check in the things that I had sent you earlier
(changing ssize_t * to a size_t *) if you deem necessary.


I changed them all too longs as you suggested.  The fixes have been  
committed to CVS.


-sam


thanks!
Murali

On 11/30/06, Sam Lang <[EMAIL PROTECTED]> wrote:



woops.  Sorry about that.

-sam




On Nov 30, 2006, at 5:52 PM, Sam Lang wrote:

>
> Hi Murali,
>
> I think you're on the right track, it looks like its a casting
> problem from int to long, but the bug appears to be in the nr_segs
> parameter passed to wait_for_io and then copy_iovec_from_user.  The
> attached patch fixes the error, as well as changes all the unsigned
> long variable definitions for the segs of iovecs to unsigned int.
> It seems safe to assume that we're never going to have more than 4
> billion or so segments, but its not clear to me why some of them
> were unsigned long and some were unsigned int.  Can you have a look
> at the patch and let me know what you think?
>
> Thanks,
>
> -sam
>
> On Nov 30, 2006, at 1:04 PM, Murali Vilayannur wrote:
>
>> Hi guys,
>> drat..there have been so many bugs in the bufmap.c code lately.
>> This must be some data type overflow or something..
>> Can you try the attached patch and see if it helps..
>> As an aside,
>> how do we make gcc complain if types don't match perfectly?
>> I am surprised how a size_t* and a ssize_t * passes type checks  
even

>> with -Wall..
>>
>> Kyle, if it does not work,
>> can you give me access to your machine? I can take a look at this
>> tonight.
>> If that is not possible, can you uncomment this line in the kernel
>> makefile
>> #EXTRA_CFLAGS += -DPVFS2_KERNEL_DEBUG
>> rebuild the module, rerun everything and send me the logs.
>> thanks,
>> Murali
>>
>>
>> On 11/30/06, Sam Lang <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi Kyle,
>>>
>>> I don't have a fix for your problem yet, but I think the message
>>> about "Please make sure that the pvfs2-client is running" is
>>> erroneous.  The real error is the  
pvfs_bufmap_copy_iovec_from_user

>>> error.
>>>
>>> Also, did you pull from CVS using the pvfs-2-6-0 release tag?  If
>>> not, the code in trunk (HEAD tag) may not be working for ppc64 at
>>> this point.
>>>
>>> -sam
>>>
>>>
>>> On Nov 30, 2006, at 10:49 AM, Kyle Schochenmaier wrote:
>>>
>>> > I was able to get the client finally built and mounted this
>>> morning
>>> > for 2.6.0-cvs, and ran across this
>>> > problem whenever trying to write/read through the vfs to the
>>> mount:
>>> > *I'm running a biarch debian-ppc64 setup on the client,  
which has

>>> > worked in the past on 2.5-cvs.
>>> >
>>> > pvfs2_bufmap_copy_iovec_from_user: computed total (0) is not  
equal

>>> > to (2862872)
>>> > /usr/src/pvfs-2.6.0-cvs/src/kernel/linux-2.6/file.c line 216:
>>> > Failed to copy-in buffers. Please make sure that the pvfs2- 
client

>>> > is running. -22
>>> >
>>> > The client doesnt crash, and is indeed still there, mounted,  
`ls`

>>> > verifies the filesystem is still up?
>>> >
>>> > I'm able to do regular operations just fine directly to the
>>> > filesystem via libpvfs2, however, nothing on the vfs mount.   
This

>>> > error message isnt very helpful to me as the client is still
>>> > running, what should I look for to debug this?
>>> >
>>> > +=Kyle
>>> >
>>> > --
>>> > Kyle Schochenmaier
>>> > [EMAIL PROTECTED]
>>> > Research Assistant, Dr. Brett Bode
>>> > AmesLab - US Dept.Energy
>>> > Scalable Computing Laboratory
>>> > ___
>>> > Pvfs2-developers mailing list
>>> > Pvfs2-developers@beowulf-underground.org
>>> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-
>>> developers
>>> >
>>>
>>> ___
>>> Pvfs2-developers mailing list
>>> Pvfs2-developers@beowulf-underground.org
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
developers

>>>
>>> 
>
> ___
> Pvfs2-developers mailing list
> Pvfs2-developers@beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2- 
developers

>









___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] read buffer bug

2006-12-01 Thread Sam Lang


Its in CVS now.  Thanks Murali for the fix, and Phil for the report.

-sam

On Nov 30, 2006, at 10:15 AM, Phil Carns wrote:


Thanks for the quick fix!

Murali Vilayannur wrote:

Hi guys,
I am really sorry about this. I am surprised we did not catch this
earlier. This was basically introduced by the file.c/bufmap.c  
cleanups

that I had done a while back.
Attached patch should fix this error.
thanks for the testcase, Phil!
Murali
On 11/29/06, Phil Carns <[EMAIL PROTECTED]> wrote:

I ran into a problem today with the 2.6.0 release.  This happened to
show up in the read04 LTP test, but not reliably.  I have attached a
test program that I think does trigger it reliably, though.

When run on ext3:

/home/pcarns> ./testme /tmp/foo.txt
read returned: 7, test_buf: hello   world

When run on pvfs2:

/home/pcarns> ./testme /mnt/pvfs2/foo.txt
read returned: 7, test_buf: hello

(or sometimes you might get garbage after the "hello")

The test program creates a string buffer with "goodbye world"  
stored in
it.  It then reads the string "hello  " out of a file into the  
beginning
of that buffer.   The result should be that the final resulting  
string

is "hello  world".

The trick that makes this fail is asking to read more than 7  
bytes from

the file.

In this particular test program, we attempt to do a read of 255  
bytes.
There are only 7 bytes in the file, though.  The return code from  
read
accurately reflects this.  However, rather than just fill in the  
first 7

bytes of the buffer, it looks like PVFS2 is overwriting the full 255
bytes.  What ends up in those trailing 248 bytes is somewhat random.

I suspect that somewhere in the kernel module there is a  
copy_to_user()
call that is copying the number of bytes requested by the read  
rather

than the number of bytes returned by the servers.

-Phil


___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers






___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] BMI questions

2006-12-01 Thread Sam Lang


On Dec 1, 2006, at 7:10 AM, Scott Atchley wrote:


On Dec 1, 2006, at 4:33 AM, Sam Lang wrote:

Your example above is currently how writes work.  The client sends  
an unexpected message to the server (a control message for the IO,  
file info, size of the IO, etc.), which posts an expected receive,  
and then sends an expected back to the client.  The client posts a  
receive for the expected before sending the unexpected.  After the  
receive of the expected message at the client completes (this is a  
'ready for IO' message from the server), It posts a send of the  
actual IO (this will be up to FlowBufferSize).  Once that send  
completes, it posts another one, and assumes that the server has  
already posted another receive (based on the size of the entire  
IO).  Once all the IO has completed at the server (including  
pushing the data to disk), the server sends a response ack  
message, which the client posted a receive for before doing any of  
the actual IO.


Ok.

It looks like the flow code on the server doesn't actually post  
the next recv of IO (IO2), until the first recv has completed  
(IO1), so its possible that the client posts (and starts) the next  
send before the server posts the next receive, although its  
probably unlikely.


If IO operations are always > 32 KB, I would agree. But if any are  
<= 32 KB, MX will buffer them on the send side and complete  
immediately. The client could then post another even if MX is in  
the middle of delivering the first one. I can override this  
behavior (use mx_issend()) or use credits for control flow.


Hm...these particular IOs are going to post BMI_send calls > 32KB.   
If the IO is less than that, we probably want to pack the IO in the  
first request.  We call that eager mode, and you would need to have  
the BMI_get_info(BMI_GET_UNEXP_SIZE) return 32K.


In either case it sounds like its possible for a bunch of client  
sends to get posted, and a bunch of server receives to get posted,  
without any of them actually completing.  Is it possible to sort all  
that out if the same tag is specified for all of them?






Each BMI receive uses a separate buffer (up to a max of 8 buffers).


Does this mean that at most, the client will post 8 IO sends per  
operation?


The 8 buffer limit is specified by the FlowBuffersPerFlow config  
option, and it just limits the number of buffers that can be  
allocated on the server (and hence the number of outstanding BMI  
operations for a particular IO).  In the diagram I sent in the  
previous email, each IOn would have had an associated buffer.  When  
it gets to 8, no more BMI_post_recv calls are made until one of the  
TROVE_post_write calls has completed first (freeing up one of the  
buffers).  None of that changes the behavior on the client, since the  
client uses the user buffer.  He keeps posting another send once a  
previous send has completed.




Every time a bmi recv completes, two things happen, the associated  
trove write is posted, and a new bmi recv is posted.  So over  
time, bmi receives will get posted at the server before bmi sends  
get posted at the client, but the second and maybe third bmi  
receives posted may be posted after the bmi sends at the client.


To answer your specific questions:

The same bmi tag is passed to each of the post_send and post_recv  
calls for the entire IO operation.


I can live with this as long as only one receive is posted at a  
time using a specific tag.


Hm..we actually do post multiple receives using the same tag.  All  
BMI messages for a given IO operation get the same tag.




As to hitting resource limits, the client doesn't post the next  
send until the previous send has completed.  I think with enough  
IO operations from different clients happening concurrently, it  
may be possible to run into the resource issues you speak of, but  
I need to verify that.


Definitely.

Yes it always posts a receive for an expected message.  For most  
expected messages the receive is guaranteed to be posted before  
the peer posts the send.  That doesn't appear to guaranteed in the  
IO case though, as I mentioned above.


Hope this helps.

-sam


Tremendously. In one of the diagrams above, you seem to indicate  
that the server will post receives for unexpected messages. Is this  
the case? If so, does it simply use BMI_method_post_recv()? With  
what tag, etc.?




From the IB code, it looks like the server does not post an  
unexpected, but relies on the BMI method to receive the message and  
put it in a queue, and then return it when  
BMI_method_test_unexpected() is called. Am I reading this wrong?


No that's partly my own confusion.  We post unexpected jobs in the  
server, but this doesn't translate to a posted receive for unexpected  
messages in BMI.  We just setup a queue for completed unexpected BMI  
messages, and populate that once BMI_testunexpected returns something.


-sam



Scott



___

Re: [Pvfs2-developers] BMI questions

2006-12-01 Thread Scott Atchley

On Dec 1, 2006, at 12:53 PM, Sam Lang wrote:

It looks like the flow code on the server doesn't actually post  
the next recv of IO (IO2), until the first recv has completed  
(IO1), so its possible that the client posts (and starts) the  
next send before the server posts the next receive, although its  
probably unlikely.


If IO operations are always > 32 KB, I would agree. But if any are  
<= 32 KB, MX will buffer them on the send side and complete  
immediately. The client could then post another even if MX is in  
the middle of delivering the first one. I can override this  
behavior (use mx_issend()) or use credits for control flow.


Hm...these particular IOs are going to post BMI_send calls > 32KB.   
If the IO is less than that, we probably want to pack the IO in the  
first request.  We call that eager mode, and you would need to have  
the BMI_get_info(BMI_GET_UNEXP_SIZE) return 32K.


The reason I mention 32 KB is that it is a magic (albeit adjustable)  
number in MX that determines when MX switches from sending messages  
eagerly to using rendezvous. I do not necessarily want to tie the  
maximum unexpected message size to that value (bmi_ib uses 8 KB for  
example).


If IO calls are always larger than 32 KB, then MX will use the  
rendezvous protocol and I do not have to worry about the server being  
overwhelmed with sends arriving before the matching post is received  
(in rendezvous mode, the client sends the header info only and the  
data stays on the client until the server indicates it is ready for  
the payload).


In either case it sounds like its possible for a bunch of client  
sends to get posted, and a bunch of server receives to get posted,  
without any of them actually completing.  Is it possible to sort  
all that out if the same tag is specified for all of them?


In MX, matching is done in order so if they use the same tag, then  
send[0] should match against recv[0], send[1] matches recv[1], etc.  
If it doesn't, we will fix it. ;-)



Each BMI receive uses a separate buffer (up to a max of 8 buffers).


Does this mean that at most, the client will post 8 IO sends per  
operation?


The 8 buffer limit is specified by the FlowBuffersPerFlow config  
option, and it just limits the number of buffers that can be  
allocated on the server (and hence the number of outstanding BMI  
operations for a particular IO).  In the diagram I sent in the  
previous email, each IOn would have had an associated buffer.  When  
it gets to 8, no more BMI_post_recv calls are made until one of the  
TROVE_post_write calls has completed first (freeing up one of the  
buffers).  None of that changes the behavior on the client, since  
the client uses the user buffer.  He keeps posting another send  
once a previous send has completed.


Ok.

Every time a bmi recv completes, two things happen, the  
associated trove write is posted, and a new bmi recv is posted.   
So over time, bmi receives will get posted at the server before  
bmi sends get posted at the client, but the second and maybe  
third bmi receives posted may be posted after the bmi sends at  
the client.


To answer your specific questions:

The same bmi tag is passed to each of the post_send and post_recv  
calls for the entire IO operation.


I can live with this as long as only one receive is posted at a  
time using a specific tag.


Hm..we actually do post multiple receives using the same tag.  All  
BMI messages for a given IO operation get the same tag.


As mentioned above, this should work. My statement that only one  
should be called was not well thought out. ;-)


No that's partly my own confusion.  We post unexpected jobs in the  
server, but this doesn't translate to a posted receive for  
unexpected messages in BMI.  We just setup a queue for completed  
unexpected BMI messages, and populate that once BMI_testunexpected  
returns something.


-sam


Ok.

Thanks,

Scott
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


Re: [Pvfs2-developers] About state machi nes

2006-12-01 Thread Kwangho CHA
Thank you very much for your fast reply.

It will really helpful to me.

Thanks again, Julian.

K. Cha

- Original Message -
From: "Julian Martin Kunkel" <[EMAIL PROTECTED]>,
To: ,
Date: 2006-12-02 00:47:45
Subject: Re: [Pvfs2-developers] About state machines

Hi,
you can find a excerpt of the documentation I made and still make for PVFS2 
here:
http://www.rzuser.uni-heidelberg.de/~jkunkel2/pvfs2-doc.pdf

Although it is not finished yet and contains some typos it documents parts of 
the interaction between system call and server statemachine invocation...

Julian
___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers




___
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers