mbufs, external storage, and MFREE
I have the following question: Let's say that I have a block of user memory which I've mapped into the kernel, and would like to send on a network socket. I'd like to simply grab an mbuf, point to the memory as external storage, and queue it up for transmission. This would work fine, except that when MFREE gets called, I have to write an deallocator that maintains a table of all the different cases where I've done this, and do a reverse mapping back to the original block, and then deal with sending more, unmapping, etc. In other words, having MFREE call a deallocator with just the data pointer and the size is inconvenient (actually, it would make my scenario quite inefficient given the number of mappings back to the original block that would have to be done). Am I missing another mechanism to handle this? Does it not come up enough to matter? -Chris To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: mbufs, external storage, and MFREE
:I have the following question: Let's say that I have a block of user :memory which I've mapped into the kernel, and would like to send on a :network socket. I'd like to simply grab an mbuf, point to the memory as :external storage, and queue it up for transmission. This would work fine, :except that when MFREE gets called, I have to write an deallocator that :maintains a table of all the different cases where I've done this, and do :a reverse mapping back to the original block, and then deal with sending :more, unmapping, etc. In other words, having MFREE call a deallocator :with just the data pointer and the size is inconvenient (actually, it :would make my scenario quite inefficient given the number of mappings back :to the original block that would have to be done). : :Am I missing another mechanism to handle this? Does it not come up enough :to matter? : :-Chris This is almost precisely the mechanism that the sendfile() system call uses. In that case it maps VMIO-backed data rather then user memory, but it is a very similar problem. There has been talk of implementing this type of mechanism not only for sockets, but for file read()/write() as well. In fact, John Dyson had delved into the issue with his vfs.ioopt stuff before he ran out of time. The one problem with using direct VM page mappings is that currently there is no way for the socket to prevent the underlying data from being modified in the middle of a transmission. And, in the same respect for vfs.ioopt, no way to prevent the data the user ostensibly read() into his 'private' buffer from changing out from under the user if the underlying file is modified. For user memory, the only way such a mechanism can currently be implemented is by obtaining the underlying pages and busy'ing them for the duration of their use by the system, causing anyone trying to access them while the system operation is in progress to block. This can cause a potential problem with TCP in that the mbuf data you send to TCP sticks around until it gets pushed out the door *and* acknowledged by the other end. i.e. the data is not disposed of as when read() or write() returns but instead goes directly into TCP's outgoing queue. If the TCP connection hangs, the process may hang. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: mbufs, external storage, and MFREE
On Thu, 23 Sep 1999, Matthew Dillon wrote: :I have the following question: Let's say that I have a block of user :memory which I've mapped into the kernel, and would like to send on a :network socket. I'd like to simply grab an mbuf, point to the memory as :external storage, and queue it up for transmission. This would work fine, :except that when MFREE gets called, I have to write an deallocator that :maintains a table of all the different cases where I've done this, and do :a reverse mapping back to the original block, and then deal with sending :more, unmapping, etc. In other words, having MFREE call a deallocator :with just the data pointer and the size is inconvenient (actually, it :would make my scenario quite inefficient given the number of mappings back :to the original block that would have to be done). : :Am I missing another mechanism to handle this? Does it not come up enough :to matter? : :-Chris This is almost precisely the mechanism that the sendfile() system call uses. In that case it maps VMIO-backed data rather then user memory, but it is a very similar problem. There has been talk of implementing this type of mechanism not only for sockets, but for file read()/write() as well. In fact, John Dyson had delved into the issue with his vfs.ioopt stuff before he ran out of time. This is good--it seems a shame to copy things around all the time, though I'm not sure where the crossover is between copying and mapping into kernel space. (And, as a side note, what's up with struct buf? The thing is bloody huge if you only want to map user memory into kernel space :) The one problem with using direct VM page mappings is that currently there is no way for the socket to prevent the underlying data from being modified in the middle of a transmission. And, in the same respect for vfs.ioopt, no way to prevent the data the user ostensibly read() into his 'private' buffer from changing out from under the user if the underlying file is modified. Isn't this a case that the programmer has to handle? That is, if you mess with the data before it actually gets written, that's your problem. I take it that vfs.ioopt stuff is something like a temporary mmap() effect, since in the socket case once the data had been put in the buffer, I'd remove the kernel mapping and thus not be able to tweak it. For user memory, the only way such a mechanism can currently be implemented is by obtaining the underlying pages and busy'ing them for the duration of their use by the system, causing anyone trying to access them while the system operation is in progress to block. This can cause a potential problem with TCP in that the mbuf data you send to TCP sticks around until it gets pushed out the door *and* acknowledged by the other end. i.e. the data is not disposed of as when read() or write() returns but instead goes directly into TCP's outgoing queue. If the TCP connection hangs, the process may hang. I had been thinking about this in the context of async io operations, where its OK to have the operation not complete until the data has actually been ack'd by the remote end. With synchronous write() calls, this can be more problematic since it would significantly increase latency in cases where the original coder might not expect it. It might actually be nice to (optionally) have the same effect with async writes to disk, where the operation wouldn't actually complete until the data was known to be on the platter. -Chris To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: mbufs, external storage, and MFREE
: vfs.ioopt, no way to prevent the data the user ostensibly read() into : his 'private' buffer from changing out from under the user if the : underlying file is modified. : :Isn't this a case that the programmer has to handle? That is, if you mess :with the data before it actually gets written, that's your problem. I :take it that vfs.ioopt stuff is something like a temporary mmap() effect, :since in the socket case once the data had been put in the buffer, I'd :remove the kernel mapping and thus not be able to tweak it. Yes and no. Sometimes changing data out from under the kernel can cause bad things to happen. For example, changing TCP data out from under the TCP protocol will screw up checksums, and making changes to a buffer undergoing DMA might screwup a device protocol crc or do something worse. The kernel thus needs to ensure that nothing the user does can screw it (the kernel) up. : For user memory, the only way such a mechanism can currently be : implemented is by obtaining the underlying pages and busy'ing them : for the duration of their use by the system, causing anyone trying to :... : : :I had been thinking about this in the context of async io operations, :where its OK to have the operation not complete until the data has :actually been ack'd by the remote end. With synchronous write() calls, :this can be more problematic since it would significantly increase latency :in cases where the original coder might not expect it. It might actually :be nice to (optionally) have the same effect with async writes to disk, :where the operation wouldn't actually complete until the data was known to :be on the platter. : :-Chris I think asynchronous I/O is the way to go with this too. An asynchronous API formally disallows changing data backing an I/O while the I/O is in progress (though it may not necessarily physically prevent the process from doing so). There is much less chance of the programmer making a mistake. Almost all of my embedded projects use asynchronous event-oriented I/O and simply eat the data out of the user process's memory space directly. Buffer copying is *really* expensive on a 68K cpu. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: mbufs, external storage, and MFREE
Matthew Dillon wrote... :I have the following question: Let's say that I have a block of user :memory which I've mapped into the kernel, and would like to send on a :network socket. I'd like to simply grab an mbuf, point to the memory as :external storage, and queue it up for transmission. This would work fine, :except that when MFREE gets called, I have to write an deallocator that :maintains a table of all the different cases where I've done this, and do :a reverse mapping back to the original block, and then deal with sending :more, unmapping, etc. In other words, having MFREE call a deallocator :with just the data pointer and the size is inconvenient (actually, it :would make my scenario quite inefficient given the number of mappings back :to the original block that would have to be done). : :Am I missing another mechanism to handle this? Does it not come up enough :to matter? : :-Chris This is almost precisely the mechanism that the sendfile() system call uses. In that case it maps VMIO-backed data rather then user memory, but it is a very similar problem. There has been talk of implementing this type of mechanism not only for sockets, but for file read()/write() as well. In fact, John Dyson had delved into the issue with his vfs.ioopt stuff before he ran out of time. The one problem with using direct VM page mappings is that currently there is no way for the socket to prevent the underlying data from being modified in the middle of a transmission. And, in the same respect for vfs.ioopt, no way to prevent the data the user ostensibly read() into his 'private' buffer from changing out from under the user if the underlying file is modified. How about marking the page copy-on-write? That way, if the user modifies the page while it is being transmitted, it'll just be copied, so the original data will be intact. Ken -- Kenneth Merry [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: mbufs, external storage, and MFREE
: :How about marking the page copy-on-write? That way, if the user modifies :the page while it is being transmitted, it'll just be copied, so the :original data will be intact. : :Ken If it were a normal page we could, but the VM system currently cannot handle pages associated with vnodes themselves being marked copy-on-write. This is kinda hard to explain, but I will try. When a process maps a file MAP_PRIVATE, the VM object held by the process is not actually a vnode. Instead it is holding what is called a default object. The default object shadows the VM object representing the vnode. When a fault occurs, vm_fault knows to copy-on-write the page from the read-only backing VM object to the the front VM object and so from the point of view of the process, the page is copy-on-write. From the system's point of view, a new page has been added to the default VM object and no changes have been made to the vnode's VM object. When a process maps a file MAP_PRIVATE or MAP_SHARED and doesn't touch any of the pages, and some other process goes in and write()'s to the file via a descriptor, the process's view of the file will change because the pages associated with the underlying vnode have changed. The problem that occurs when we try to optimize read by mapping a vnode's page into a user address space is that some other process may go and modify the underlying file, modifying the data that the user process sees *after* the read() has returned. But the user process is expecting that data not to change because it thinks it has read() it into a private buffer when, in fact, the OS optimized the read by replacing the private memory with the file map. i.e. our problem is not so much the user process making a change to its buffer -- that case is handled by copy-on-write, but of another process writing directly to the vnode causing the data the first process read() to appear to change in its buffer. -Matt Matthew Dillon [EMAIL PROTECTED] :-- :Kenneth Merry :[EMAIL PROTECTED] : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message