mbufs, external storage, and MFREE

1999-09-23 Thread Christopher Sedore


I have the following question:  Let's say that I have a block of user
memory which I've mapped into the kernel, and would like to send on a
network socket.  I'd like to simply grab an mbuf, point to the memory as
external storage, and queue it up for transmission.  This would work fine,
except that when MFREE gets called, I have to write an deallocator that
maintains a table of all the different cases where I've done this, and do
a reverse mapping back to the original block, and then deal with sending
more, unmapping, etc.  In other words, having MFREE call a deallocator
with just the data pointer and the size is inconvenient (actually, it
would make my scenario quite inefficient given the number of mappings back
to the original block that would have to be done).

Am I missing another mechanism to handle this?  Does it not come up enough
to matter? 

-Chris



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mbufs, external storage, and MFREE

1999-09-23 Thread Matthew Dillon

:I have the following question:  Let's say that I have a block of user
:memory which I've mapped into the kernel, and would like to send on a
:network socket.  I'd like to simply grab an mbuf, point to the memory as
:external storage, and queue it up for transmission.  This would work fine,
:except that when MFREE gets called, I have to write an deallocator that
:maintains a table of all the different cases where I've done this, and do
:a reverse mapping back to the original block, and then deal with sending
:more, unmapping, etc.  In other words, having MFREE call a deallocator
:with just the data pointer and the size is inconvenient (actually, it
:would make my scenario quite inefficient given the number of mappings back
:to the original block that would have to be done).
:
:Am I missing another mechanism to handle this?  Does it not come up enough
:to matter? 
:
:-Chris

This is almost precisely the mechanism that the sendfile() system call
uses.  In that case it maps VMIO-backed data rather then user memory,
but it is a very similar problem.

There has been talk of implementing this type of mechanism not only for
sockets, but for file read()/write() as well.  In fact, John Dyson had
delved into the issue with his vfs.ioopt stuff before he ran out of time.

The one problem with using direct VM page mappings is that currently there
is no way for the socket to prevent the underlying data from being 
modified in the middle of a transmission.  And, in the same respect for
vfs.ioopt, no way to prevent the data the user ostensibly read() into
his 'private' buffer from changing out from under the user if the
underlying file is modified.

For user memory, the only way such a mechanism can currently be 
implemented is by obtaining the underlying pages and busy'ing them
for the duration of their use by the system, causing anyone trying to
access them while the system operation is in progress to block.  This
can cause a potential problem with TCP in that the mbuf data you send
to TCP sticks around until it gets pushed out the door *and* acknowledged
by the other end.  i.e. the data is not disposed of as when read() or 
write() returns but instead goes directly into TCP's outgoing queue.
If the TCP connection hangs, the process may hang.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mbufs, external storage, and MFREE

1999-09-23 Thread Christopher Sedore



On Thu, 23 Sep 1999, Matthew Dillon wrote:

 :I have the following question:  Let's say that I have a block of user
 :memory which I've mapped into the kernel, and would like to send on a
 :network socket.  I'd like to simply grab an mbuf, point to the memory as
 :external storage, and queue it up for transmission.  This would work fine,
 :except that when MFREE gets called, I have to write an deallocator that
 :maintains a table of all the different cases where I've done this, and do
 :a reverse mapping back to the original block, and then deal with sending
 :more, unmapping, etc.  In other words, having MFREE call a deallocator
 :with just the data pointer and the size is inconvenient (actually, it
 :would make my scenario quite inefficient given the number of mappings back
 :to the original block that would have to be done).
 :
 :Am I missing another mechanism to handle this?  Does it not come up enough
 :to matter? 
 :
 :-Chris
 
 This is almost precisely the mechanism that the sendfile() system call
 uses.  In that case it maps VMIO-backed data rather then user memory,
 but it is a very similar problem.
 
 There has been talk of implementing this type of mechanism not only for
 sockets, but for file read()/write() as well.  In fact, John Dyson had
 delved into the issue with his vfs.ioopt stuff before he ran out of time.

This is good--it seems a shame to copy things around all the time, though
I'm not sure where the crossover is between copying and mapping into
kernel space.  (And, as a side note, what's up with struct buf? The thing
is bloody huge if you only want to map user memory into kernel space :)

 The one problem with using direct VM page mappings is that currently there
 is no way for the socket to prevent the underlying data from being 
 modified in the middle of a transmission.  And, in the same respect for
 vfs.ioopt, no way to prevent the data the user ostensibly read() into
 his 'private' buffer from changing out from under the user if the
 underlying file is modified.

Isn't this a case that the programmer has to handle?  That is, if you mess
with the data before it actually gets written, that's your problem.  I
take it that vfs.ioopt stuff is something like a temporary mmap() effect,
since in the socket case once the data had been put in the buffer, I'd
remove the kernel mapping and thus not be able to tweak it.
 
 For user memory, the only way such a mechanism can currently be 
 implemented is by obtaining the underlying pages and busy'ing them
 for the duration of their use by the system, causing anyone trying to
 access them while the system operation is in progress to block.  This
 can cause a potential problem with TCP in that the mbuf data you send
 to TCP sticks around until it gets pushed out the door *and* acknowledged
 by the other end.  i.e. the data is not disposed of as when read() or 
 write() returns but instead goes directly into TCP's outgoing queue.
 If the TCP connection hangs, the process may hang.
 

I had been thinking about this in the context of async io operations,
where its OK to have the operation not complete until the data has
actually been ack'd by the remote end.  With synchronous write() calls,
this can be more problematic since it would significantly increase latency
in cases where the original coder might not expect it.  It might actually
be nice to (optionally) have the same effect with async writes to disk,
where the operation wouldn't actually complete until the data was known to
be on the platter.

-Chris




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mbufs, external storage, and MFREE

1999-09-23 Thread Matthew Dillon

: vfs.ioopt, no way to prevent the data the user ostensibly read() into
: his 'private' buffer from changing out from under the user if the
: underlying file is modified.
:
:Isn't this a case that the programmer has to handle?  That is, if you mess
:with the data before it actually gets written, that's your problem.  I
:take it that vfs.ioopt stuff is something like a temporary mmap() effect,
:since in the socket case once the data had been put in the buffer, I'd
:remove the kernel mapping and thus not be able to tweak it.

Yes and no.  Sometimes changing data out from under the kernel can
cause bad things to happen.  For example, changing TCP data out from
under the TCP protocol will screw up checksums, and making changes
to a buffer undergoing DMA might screwup a device protocol crc or
do something worse.  The kernel thus needs to ensure that nothing the
user does can screw it (the kernel) up.

: For user memory, the only way such a mechanism can currently be 
: implemented is by obtaining the underlying pages and busy'ing them
: for the duration of their use by the system, causing anyone trying to
:...
: 
:
:I had been thinking about this in the context of async io operations,
:where its OK to have the operation not complete until the data has
:actually been ack'd by the remote end.  With synchronous write() calls,
:this can be more problematic since it would significantly increase latency
:in cases where the original coder might not expect it.  It might actually
:be nice to (optionally) have the same effect with async writes to disk,
:where the operation wouldn't actually complete until the data was known to
:be on the platter.
:
:-Chris

I think asynchronous I/O is the way to go with this too.  An
asynchronous API formally disallows changing data backing an I/O
while the I/O is in progress (though it may not necessarily physically
prevent the process from doing so).  There is much less chance of the
programmer making a mistake.

Almost all of my embedded projects use asynchronous event-oriented
I/O and simply eat the data out of the user process's memory space
directly.  Buffer copying is *really* expensive on a 68K cpu.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mbufs, external storage, and MFREE

1999-09-23 Thread Kenneth D. Merry

Matthew Dillon wrote...
 :I have the following question:  Let's say that I have a block of user
 :memory which I've mapped into the kernel, and would like to send on a
 :network socket.  I'd like to simply grab an mbuf, point to the memory as
 :external storage, and queue it up for transmission.  This would work fine,
 :except that when MFREE gets called, I have to write an deallocator that
 :maintains a table of all the different cases where I've done this, and do
 :a reverse mapping back to the original block, and then deal with sending
 :more, unmapping, etc.  In other words, having MFREE call a deallocator
 :with just the data pointer and the size is inconvenient (actually, it
 :would make my scenario quite inefficient given the number of mappings back
 :to the original block that would have to be done).
 :
 :Am I missing another mechanism to handle this?  Does it not come up enough
 :to matter? 
 :
 :-Chris
 
 This is almost precisely the mechanism that the sendfile() system call
 uses.  In that case it maps VMIO-backed data rather then user memory,
 but it is a very similar problem.
 
 There has been talk of implementing this type of mechanism not only for
 sockets, but for file read()/write() as well.  In fact, John Dyson had
 delved into the issue with his vfs.ioopt stuff before he ran out of time.
 
 The one problem with using direct VM page mappings is that currently there
 is no way for the socket to prevent the underlying data from being 
 modified in the middle of a transmission.  And, in the same respect for
 vfs.ioopt, no way to prevent the data the user ostensibly read() into
 his 'private' buffer from changing out from under the user if the
 underlying file is modified.

How about marking the page copy-on-write?  That way, if the user modifies
the page while it is being transmitted, it'll just be copied, so the
original data will be intact.

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: mbufs, external storage, and MFREE

1999-09-23 Thread Matthew Dillon

:
:How about marking the page copy-on-write?  That way, if the user modifies
:the page while it is being transmitted, it'll just be copied, so the
:original data will be intact.
:
:Ken

If it were a normal page we could, but the VM system currently cannot
handle pages associated with vnodes themselves being marked copy-on-write.  

This is kinda hard to explain, but I will try.

When a process maps a file MAP_PRIVATE, the VM object held by the process
is not actually a vnode.   Instead it is holding what is called a
default object.  The default object shadows the VM object representing
the vnode.  When a fault occurs, vm_fault knows to copy-on-write the page
from the read-only backing VM object to the the front VM object and so
from the point of view of the process, the page is copy-on-write.  From
the system's point of view, a new page has been added to the default
VM object and no changes have been made to the vnode's VM object.

When a process maps a file MAP_PRIVATE or MAP_SHARED and doesn't touch
any of the pages, and some other process goes in and write()'s to the
file via a descriptor, the process's view of the file will change
because the pages associated with the underlying vnode have changed.

The problem that occurs when we try to optimize read by mapping
a vnode's page into a user address space is that some other process
may go and modify the underlying file, modifying the data that the
user process sees *after* the read() has returned.  But the user process
is expecting that data not to change because it thinks it has read() it
into a private buffer when, in fact, the OS optimized the read by replacing
the private memory with the file map.

i.e. our problem is not so much the user process making a change to its
buffer -- that case is handled by copy-on-write, but of another process
writing directly to the vnode causing the data the first process read()
to appear to change in its buffer.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
:-- 
:Kenneth Merry
:[EMAIL PROTECTED]
:



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message