Hi, On Sat, Oct 30, 2010 at 06:55:42PM -0700, Chuck Silvers wrote: > On Wed, Oct 27, 2010 at 06:38:11PM +0900, Masao Uebayashi wrote: > > On Tue, Oct 26, 2010 at 02:06:38PM -0700, Chuck Silvers wrote: > > > On Mon, Oct 25, 2010 at 02:09:43AM +0900, Masao Uebayashi wrote: > > > > I think the uebayasi-xip branch is ready to be merged. > > > > > > hi, > > > > > > here's what I found looking at the current code in the branch: > > > > > > > > > - the biggest issue I had with the version that I reviewed earlier > > > was that it muddled the notion of a "managed" page. you wanted > > > to create a new kind of partially-managed page for XIP devices > > > which would not be managed by the UVM in the sense that it could > > > contain different things depending on what was needed but only that > > > the page's mappings would be tracked so that pmap_page_protect() > > > could be called on it. this is what led to all the pmap changes > > > the pmaps needed to be able to handle being called with a vm_page > > > pointer that didn't actually point to a struct vm_page. > > > > > > it looks like you've gotten rid of that, which I like, but you've > > > replaced it with allocating a full vm_page structure for every page in > > > an XIP device, which seems like a waste of memory. as we discussed > > > earlier, I think it would be better to treat XIP pages as unmanaged > > > and change a few other parts of UVM to avoid needing to track the > > > mappings of XIP page mappings. I have thoughts on how to do all that, > > > which I'll list at the end of this mail. however, if XIP devices > > > are small enough that the memory overhead of treating device pages > > > as managed is reasonable, then I'll go along with it. > > > so how big do these XIP devices get? > > > > It's waste of memory, yes. With 64M ROM on arm (4K page, 80byte vm_page), > > the array is 1.25M. If vm_page's made a single pointer (sizeof(void > > *) == 4), the array size becomes 64K. Not small difference. > > Typical XIP'ed products would be mobile devices with FlashROM, or small > > servers with memory disk (md or xmd). About 16M~1G RAM/ROM? > > > > I made it back to have vm_page to simplify code. We can make it > > to vm_page_md or whatever minimal, once after we figure out the > > new design of "MI vm_page_md". > > > > > either way, the changes to the various pmaps to handle the fake vm_page > > > pointers aren't necessary anymore, so all those changes should be > > > reverted. > > > > Those mechanical vm_page -> vm_page_md changes done in pmaps have > > a valid point by itself. Passing around vm_page pointer across > > pmap functions is unnecessary. I'd rather say wrong. All pmap > > needs is vm_page_md. > > > > I'd propose to do this vm_page -> vm_page_md change in pmaps first > > in HEAD and sync the branch with it, rather than revert it. > > that seems a bit premature, don't you think? > since you're already talking about redesigning it? > > ... apparently not, you already checked it in.
It's not premature. It clarifies that passing struct vm_page * to pmap is unnecessary at all. We'll need to move those MD PV data structures to MI anyway. This is what I'm having in my head: struct vm_physseg { : paddr_t start, end; : struct vm_page *pgs; : struct vm_pv *pvs; : }; struct vm_pv { /* == struct vm_page_md, or what's called "PV head" now */ }; struct vm_pv_entry { /* == what's called "PV entry" now */ }; All pmaps are converted to use struct vm_pv array, instead of struct vm_page::struct vm_page_md. Here, the page identity is (struct vm_physseg *, off_t). You can retrieve the physical address of a given struct vm_page * or struct vm_pv *, by looking up the matching struct vm_physseg from the global list (as talked about killing vm_page::phys_addr). Of course XIP-capable, read-only physical segments have only vm_pv[]. I think tracking PV there for shared amap is worth doing in the "generic" XIP design. pmaps are passed struct vm_pv *, not struct vm_page *. Physical address is looked up by calling a function. > > > > > it doesn't look like the PMAP_UNMANAGED flag that you added before > > > is necessary anymore either, is that right? if so, it should also > > > be reverted. if not, what's it for now? > > > > pmap_enter() passes paddr_t directly to pmap. pmap has no clue if > > the given paddr_t is to be cached/uncached. So a separate flag, > > PMAP_UNMANAGED. Without it, a FlashROM which is registered as > > (potentially) managed (using bus_space_physload_device) is always > > mapped to user address as cached, even if it's not XIP. This made > > userland flash writer program not work. > > > > The point is whether to be cached or not is decided by virtual > > address. Thus such an information should be stored in vm_map_entry > > explicitly, in theory. > > > > (This is related to framebuffers too.) > > there is already an MI mechanism for specifying cachability of mappings, > see PMAP_NOCACHE and related flags. This is fixed, thanks. > > > > > - I mentioned before that I think the driver for testing this using > > > normal memory to simulate an XIP device should just be the existing > > > "md" driver, rather than a whole new driver whose only purpose > > > would be testing the XIP code. however, you wrote a separate > > > "xmd" driver anyway. so I'll say again: I think the "xmd" should be > > > merged into the "md" driver. > > > > It has turned out that the new xmd(4) is not really md(4); it's not > > md(4) modified to support XIP > > but rather > > rom(4) emulated using RAM > > > > I plan to merge xmd(4) into rom(4) or flash(4) when it's ready. > > If you don't like xmd(4) to reserve dev namespace, I'd rather back > > out xmd(4) totally. > > > > If you don't like the name, suggest a better one. ;) > > I'll respond to this in the other thread you started about it. > but creating a temporary device driver does seem pretty silly. > > > > > you also have an "xmd_machdep.c" for various platforms, but nothing > > > in that file is really machine-specific. rather than use the > > > machine-specific macros for converting between bytes and pages, > > > it would be better to either use the MI macros (ptoa / atop) > > > or just shift by PAGE_SHIFT, so that there doesn't need to be > > > extra code written for each platform. > > > > Well, xmd_machdep.c is not really xmd(4)-specific. It *is* > > machine-specific. It does vtophys(), not good put in MI part. > > > > ... and now I realize xmd_machdep_* should be renamed as pmap_*. > > I'll fix this. > > and this in the other thread too. I've made some changes this week. I've realized that xmd(4) is as strange as mem(4). :) I believe its design is clarified now (I'll fix mdsetimage(8) soon). The only remaining problem is its name. > > > > > - as we discussed before, the addition of bus_space_physload() and > > > related > > > interfaces seems very strange to me, especially since the two > > > implementations > > > you've added both do exactly the same thing (call bus_space_mmap() on > > > the > > > start and end of the address range, and then call uvm_page_physload() > > > with the resulting paddr_t values). is there any reason this can't be > > > a MI function? also, the non-device versions of these are unused. > > > > I'll move these code to sys/common. > > the location in the source tree isn't the weird part. > but a common implementation is an improvement. > > > > bus_space_physload(9) will be used when a driver wants to register > > its memory as a general purpose memory (linked into free list). > > Think PCI RAM over bridges. I've not written a driver to do this, > > but it's trivial. > > > > We also have to fix uvm_page_physload() so that it can be called > > at run-time, not only boot-time. But this is beyond this branch. > > > > > - in many files the only change you made was to replace the include of > > > <uvm/uvm_extern.h> with <uvm/uvm.h>, what's the reason for that? > > > > To reduce rebuild time when I worked on uvm_page.[ch]. Basically > > uvm_page.h is internal to UVM. Touching it outside of UVM is simply > > wrong. > > > > I'll put it back before the merge, to avoid potential build failures. > > thanks. Do you agree that exposing uvm_page.h to external world is wrong? Touch uvm_page.h, rebuild a kernel, then think. Why do those rebuilt files care about uvm_page.h definitions? This is a design problem, putting it back is just a work-around. > > > > > - we talked before about removing the "xip" mount option and enabling > > > this functionality automatically when it's available, which you did > > > but then recently changed back. so I'll ask again, > > > why do we need a mount option? > > > > I put it back because otherwise I can't know if a mount is XIP or > > not explicitly. Mount option is the only way to know mount condition. > > This is more a VFS/mount design problem. > > is your desire to control whether the mount accesses the device via mappings, > or just to be able to see whether or not the device is being accessed via > mappings? Both. My understanding is that mount options change only its internal behavior. I don't see how MNT_LOG and MNT_XIP differ. Mount is done only by admins who know internals. There may be cases where an XIP device is much slower than RAM, and page cache is wanted. I also think that mount option is the only way to configure per-mount behavior. > > > > > - as we've discussed before, I think that the XIP genfs_getpages() > > > should be merged with the existing one before you merge this into > > > -current. merging it as-is would make a mess, and it's better to > > > avoid creating messes than than to promise to clean them up later. > > > we've probably already spent more time arguing about it than > > > it would have taken you to do it. > > > > > > - similarly, I don't think that we should have a separate XIP putpages. > > > why doesn't the existing genfs_do_putpages() work for XIP pages as-is? > > > > This part I'm not convinced. Your argument can be hardly an excuse > > for you to not work on merging genfs_getpages() and genfs_do_io(). ;) > > > > If you *really* want me to work on this before the merge (which I > > do hope not), I'd create a separate branch to refactor genfs totally > > into smaller pieces as I did for uvm_fault.c. If you trust my > > refactoring-fu. > > well, it's been about a decade since I wrote this code so my memory is > a bit hazy as to how it ended up that way, but from looking at the cvs > history a bit just now, I think that the getpages and putpages paths > started off completely separate, and when I added the directio code, > I split genfs_do_io() out of the putpages code and called it from both. > yes, it would have been good to merge the getpages version at that point > too, but there wasn't a point where I took a unified code path and made > a copy of it. > > and I'm not giving any argument for not merging these functions, > I'm saying that their existance is not a good excuse to make another copy. IMO genfs abstraction is wrong in the first place. ;) All of its code should belong to either sys/uvm/uvm_vnode.c or sys/kern/vfs_*.c. > > to try to answer my own question about why you have a separate putpages, > I guess that's to handle the page of zeroes that you have attached to > each vnode. without the per-vnode page of zeroes, the existing > putpages would work fine. I'll rethink this zero page thing. > > > > > - in getpages, rather than allocating a page of zeros for each file, > > > it would be better to use the same page of zeroes that mem.c uses. > > > if multiple subsystems need a page of zeroes, then I think that > > > UVM should provide this functionality rather than duplicating > > > that all over the place. if you don't want to update all the copies > > > of mem.c, that's fine, but please put this in UVM instead of genfs. > > > > I made it per-vnode, because If you share a single physical zero'ed > > page, when a vnode is putpage'ed, it invalidates all mappings in > > other vnodes pointing to the zero page in a pmap. Doesn't this > > sound strange? > > I think you're misunderstanding something. for operations like > msync(), putpage is called on the object in the UVM map entry, > which would never be the object owning the global page of zeroes. > from the pagedaemon, putpage is called on a page's object, > but if the page replacement policy decides to reclaim the page > of zeroes then invalidating all the mappings is necessary. > if what you describe would actually happen, then yes, > I would find it strange. Give me some time to rethink this zero page thing... > > > > Another reason is that by having a vm_page with a uvm_object as > > its parent, fault handler doesn't need to know anything about the > > special zero'ed page. > > the fault handler wouldn't need to know about the page of zeroes, > why do you think it would? it does need to be aware that the pages > returned by getpages may not be owned by the object that getpages > is called with, but it needs to handle that regardless, for layered > file systems. Ditto (but my gut feeling is, layered file systems are totally broken by design.) > > > > > - I don't remember if I asked about this before: is there some reason > > > for there to be XIP code in the fs-specific code? couldn't this be > > > done in the VFS and vnode code instead so that it works for all > > > file system types instead of just FFS? > > > > This is due to the current VFS design. I agree that fs-neutral part > > should be moved to common part. It's just beyond this branch... > > could you explain what about the VFS design makes this hard? > > I guess I mentioned this mainly because in your earlier mail > it sounded like this feature would be available for all fs types. All mount options are processed in VFS_MOUNT()... And yes, this XIP can be used for all FSes. > > > > > - as you guessed in your mail, I don't like the new PQ_FIXED flag either. > > > it looks like it's only used for preventing the XIP pages from > > > being put on the paging queues, but that would be better done by > > > just initializing pg->wire_count to 1 so that UVM treats them > > > as permanently non-pageable (which they actually are). > > > > Are you OK to prevent XIP pages from being loaned? ;) > > > > The actual problems here are > > > > a) the generic fault handler doesn't know how to handle pages that > > never be paged out (device pages), and > > > > b) uvm loan uses vm_page. > > > > You wanted device pages to be able to be loaned. I agree the idea, > > but I'm feeling like that the way the loan "state" is kept > > (vm_page::loan_count) is wrong. I've not figured out how to address > > this yet. > > I've thought about loaning and XIP again quite a bit, but in the interest > of actually sending this mail today, I'll leave this part for later too. I spent a little time for this... Now uvm_loan() returns to a loaner an array of pointers to struct vm_page. Which means that those struct vm_page's are shared among loaner and loanee. The owner of those pages are recorded in vm_page::uanon and vm_page::uobject. I'm thinking this design is broken. At least A->A loan is impossible for no reason. I think loanee should allocate (by kmem_alloc(9)) a new *alias* vm_page, to keep loaning state. Alias vm_page has a pointer to its new owner, and with a pointer to the *real* vm_page too. uanon/uobject is also bad in that it obscures the 1:1 relationship of pages and owners. These two members should be merged into a single void *pg_owner. vm_page::loan_count is only for loaner. Loaners of read-only, wired pages don't need to remember anything. Masao > > -Chuck -- Masao Uebayashi / Tombi Inc. / Tel: +81-90-9141-4635