Re: Question about adding flags to mmap system call / NVIDIA amd64 driver implementation
On Tuesday 28 April 2009 7:58:57 pm Julian Elischer wrote: Robert Noland wrote: On Tue, 2009-04-28 at 16:48 -0500, Kevin Day wrote: On Apr 28, 2009, at 3:19 PM, Julian Bangert wrote: Hello, I am currently trying to work a bit on the remaining missing feature that NVIDIA requires ( http://wiki.freebsd.org/NvidiaFeatureRequests or a back post in this ML) - the improved mmap system call. you might check with jhb (john Baldwin) as I think (from his p4 work) that he may be doing something in this area in p4. After some promptings from Robert and his needs for Xorg recently I did start hacking on this again. However, I haven't tested it yet. What I have done so far is in //depot/user/jhb/pat/... and it does the following: 1) Adds a vm_cache_mode_t. Each arch defines the valid values for this (I've only done the MD portions of this work for amd64 so far). Every arch must at least define a value for VM_CACHE_DEFAULT. 2) Stores a cache mode in each vm_map_entry struct. This cache mode is then passed down to a few pmap functions: pmap_object_init_pt(), pmap_enter_object(), and pmap_enter_quick(). Several vm_map routines such as vm_map_insert() and vm_map_find() now take a cache mode to use when adding a new mapping. 3) Each VM object stores a cache mode as well (defaults to VM_CACHE_DEFAULT). When a VM_CACHE_DEFAULT mapping is made of an object, the cache mode of the object is used. 4) A new VM object type: OBJT_SG. This object type has its own pager that is sort of like the device pager. However, instead of invoking d_mmap() to determine the physaddr for a given page, it consults a pre-created scatter/gather list (an ADT from my branch for working on unmapped buffer I/O) to determine the backing physical address for a given virtual address. 5) A new callback for device mmap: d_mmap_single(). One of the features of this is that it can return a vm_object_t to be used to satisfy the mmap() request instead of using the device's device pager VM object. 6) A new mcache() system call similar to mprotect(), except that it changes the cache mode of an address range rather than the protection. This may not be all that useful really. Given all this, a driver could do the following to map a thing as WC in both userland and the kernel: 1) When it learns about a thing it creates a SG list to describe it. If the thing consists of userland pages, it has to wire the pages first. The driver can use vm_allocate_pager() to create a OBJT_SG VM object. It sets the object's cache mode to VM_CACHE_WC (if the arch supports that). 2) When userland wants to map the thing it does a device mmap() with a proper length and a file offset that is a cookie for the thing. The device driver's d_mmap_single() recognizes the magic file offset and returns the thing's VM object. Since the mapping info is now part of a normal object mapping, it will go away via munmap(), etc. The driver no longer has to do weird gymnastics to invalidate mappings from its device pager as transient mappings are no longer stored in the device pager. 3) When the driver wants to map the thing into the kernel, it can use vm_map_find() to insert the thing's VM object into kernel map. And I think that is all there is to it. I need to test this somehow to make sure though, and make sure this meets the needs of Robert and Nvidia. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Question about adding flags to mmap system call / NVIDIA amd64 driver implementation
On Thu, 2009-04-30 at 17:36 -0400, John Baldwin wrote: On Tuesday 28 April 2009 7:58:57 pm Julian Elischer wrote: Robert Noland wrote: On Tue, 2009-04-28 at 16:48 -0500, Kevin Day wrote: On Apr 28, 2009, at 3:19 PM, Julian Bangert wrote: Hello, I am currently trying to work a bit on the remaining missing feature that NVIDIA requires ( http://wiki.freebsd.org/NvidiaFeatureRequests or a back post in this ML) - the improved mmap system call. you might check with jhb (john Baldwin) as I think (from his p4 work) that he may be doing something in this area in p4. After some promptings from Robert and his needs for Xorg recently I did start hacking on this again. However, I haven't tested it yet. What I have done so far is in //depot/user/jhb/pat/... and it does the following: 1) Adds a vm_cache_mode_t. Each arch defines the valid values for this (I've only done the MD portions of this work for amd64 so far). Every arch must at least define a value for VM_CACHE_DEFAULT. 2) Stores a cache mode in each vm_map_entry struct. This cache mode is then passed down to a few pmap functions: pmap_object_init_pt(), pmap_enter_object(), and pmap_enter_quick(). Several vm_map routines such as vm_map_insert() and vm_map_find() now take a cache mode to use when adding a new mapping. 3) Each VM object stores a cache mode as well (defaults to VM_CACHE_DEFAULT). When a VM_CACHE_DEFAULT mapping is made of an object, the cache mode of the object is used. 4) A new VM object type: OBJT_SG. This object type has its own pager that is sort of like the device pager. However, instead of invoking d_mmap() to determine the physaddr for a given page, it consults a pre-created scatter/gather list (an ADT from my branch for working on unmapped buffer I/O) to determine the backing physical address for a given virtual address. 5) A new callback for device mmap: d_mmap_single(). One of the features of this is that it can return a vm_object_t to be used to satisfy the mmap() request instead of using the device's device pager VM object. 6) A new mcache() system call similar to mprotect(), except that it changes the cache mode of an address range rather than the protection. This may not be all that useful really. Given all this, a driver could do the following to map a thing as WC in both userland and the kernel: 1) When it learns about a thing it creates a SG list to describe it. If the thing consists of userland pages, it has to wire the pages first. The driver can use vm_allocate_pager() to create a OBJT_SG VM object. It sets the object's cache mode to VM_CACHE_WC (if the arch supports that). 2) When userland wants to map the thing it does a device mmap() with a proper length and a file offset that is a cookie for the thing. The device driver's d_mmap_single() recognizes the magic file offset and returns the thing's VM object. Since the mapping info is now part of a normal object mapping, it will go away via munmap(), etc. The driver no longer has to do weird gymnastics to invalidate mappings from its device pager as transient mappings are no longer stored in the device pager. 3) When the driver wants to map the thing into the kernel, it can use vm_map_find() to insert the thing's VM object into kernel map. And I think that is all there is to it. I need to test this somehow to make sure though, and make sure this meets the needs of Robert and Nvidia. I think this sounds pretty good... I need to get my perforce foo up to speed so I can try it out... robert. -- Robert Noland rnol...@freebsd.org FreeBSD signature.asc Description: This is a digitally signed message part
Re: Question about adding flags to mmap system call / NVIDIA amd64 driver implementation
On Tue, Apr 28, 2009 at 22:19, Julian Bangert julid...@online.de wrote: Hello, I am currently trying to work a bit on the remaining missing feature that NVIDIA requires ( http://wiki.freebsd.org/NvidiaFeatureRequests or a back post in this ML) - the improved mmap system call. For now, I am trying to extend the current system call and implementation to add cache control ( the type of memory caching used) . This feature inherently is very architecture specific- but it can lead to enormous performance improvements for memmapped devices ( useful for drivers, etc). I would do this at the user site by adding 3 flags to the mmap system call (MEM_CACHE__ATTR1 to MEM_CACHE__ATTR3 ) which are a single octal digit corresponding to the various caching options ( like Uncacheable,Write Combining, etc... ) with the same numbers as the PAT_* macros from i386/include/specialreg.h except that the value 0 ( PAT_UNCACHEABLE ) is replaced with value 2 ( undefined), whereas value 0 ( all 3 flags cleared) is assigned the meaning feature not used, use default cache control. For each cache behaviour there would of course also be a macro expanding to the rigth combination of these flags for enhanced useability. Hmm, I don't like that. What about using something like PAT_WC directly for the userland? Afaik a userland app that uses stuff like this is md anyway. The mmap system call would, if any of these flags are set, decode them and get a corresponding PAT_* value, perform the mapping and then call into the pmap module to modify the cache attributes for every page. My first question is if there is a more elegant way of solving that - the 3 flags would be architecture specific ( they could be used for other things on other architectures though if need be ) and I do not know the policy on architecture specific syscall flags, therefore I appreciate any input. The second question goes to all those great VM/pmap gurus out there: As far as I understand, at the moment the pmap_change_attr can only cange the cache flags for kernel pages. Is there a particular reason why this function might not be adapted/extended to userspace mappings? If not, I would either add a new function to iterate over all pages and set cache flags for a particular region or add a new member (possibly just add the 3 flags again ? ) to the md part of vm_page_t. Or one could just keep track and return errors as soon as someone tries to map a memory region ( cache-customized mapping is usually done to device memory ) already mapped with different cache behaviour. Do you know how other OS handle this stuff? Maybe there is some inspiration there for a clean interface. I'm not sure if I remember correctly but there is something in my mind that we must take care that no virtual pages have different PAT settings for the same physical page. Maybe I read something like this in the AMD's documentation of PAT. Sorry I don't remember exactly but perhaps someone else can explain it better. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Question about adding flags to mmap system call / NVIDIA amd64 driver implementation
On Apr 28, 2009, at 3:19 PM, Julian Bangert wrote: Hello, I am currently trying to work a bit on the remaining missing feature that NVIDIA requires ( http://wiki.freebsd.org/NvidiaFeatureRequests or a back post in this ML) - the improved mmap system call. For now, I am trying to extend the current system call and implementation to add cache control ( the type of memory caching used) . This feature inherently is very architecture specific- but it can lead to enormous performance improvements for memmapped devices ( useful for drivers, etc). I would do this at the user site by adding 3 flags to the mmap system call (MEM_CACHE__ATTR1 to MEM_CACHE__ATTR3 ) which are a single octal digit corresponding to the various caching options ( like Uncacheable,Write Combining, etc... ) with the same numbers as the PAT_* macros from i386/include/ specialreg.h except that the value 0 ( PAT_UNCACHEABLE ) is replaced with value 2 ( undefined), whereas value 0 ( all 3 flags cleared) is assigned the meaning feature not used, use default cache control. For each cache behaviour there would of course also be a macro expanding to the rigth combination of these flags for enhanced useability. The mmap system call would, if any of these flags are set, decode them and get a corresponding PAT_* value, perform the mapping and then call into the pmap module to modify the cache attributes for every page. Have you looked at mem(4) yet? Several architectures allow attributes to be associated with ranges of physical memory. These attributes can be manipulated via ioctl() calls performed on /dev/mem. Declarations and data types are to be found in sys/memrange.h. The specific attributes, and number of programmable ranges may vary between architectures. The full set of supported attributes is: MDF_UNCACHEABLE The region is not cached. MDF_WRITECOMBINE Writes to the region may be combined or performed out of order. MDF_WRITETHROUGH Writes to the region are committed synchronously. MDF_WRITEBACK Writes to the region are committed asynchronously. MDF_WRITEPROTECT The region cannot be written to. This requires knowledge of the physical addresses, but I believe that's probably already necessary for what it sounds like you're trying to accomplish. Back in the FreeBSD-3.0 days, I was writing a custom driver for an AGP graphics controller, and setting the MTRR flags for the exposed buffer was a definite improvement (200-1200% faster in most cases). -- Kevin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Question about adding flags to mmap system call / NVIDIA amd64 driver implementation
On Tue, 2009-04-28 at 16:48 -0500, Kevin Day wrote: On Apr 28, 2009, at 3:19 PM, Julian Bangert wrote: Hello, I am currently trying to work a bit on the remaining missing feature that NVIDIA requires ( http://wiki.freebsd.org/NvidiaFeatureRequests or a back post in this ML) - the improved mmap system call. For now, I am trying to extend the current system call and implementation to add cache control ( the type of memory caching used) . This feature inherently is very architecture specific- but it can lead to enormous performance improvements for memmapped devices ( useful for drivers, etc). I would do this at the user site by adding 3 flags to the mmap system call (MEM_CACHE__ATTR1 to MEM_CACHE__ATTR3 ) which are a single octal digit corresponding to the various caching options ( like Uncacheable,Write Combining, etc... ) with the same numbers as the PAT_* macros from i386/include/ specialreg.h except that the value 0 ( PAT_UNCACHEABLE ) is replaced with value 2 ( undefined), whereas value 0 ( all 3 flags cleared) is assigned the meaning feature not used, use default cache control. For each cache behaviour there would of course also be a macro expanding to the rigth combination of these flags for enhanced useability. The mmap system call would, if any of these flags are set, decode them and get a corresponding PAT_* value, perform the mapping and then call into the pmap module to modify the cache attributes for every page. Have you looked at mem(4) yet? Several architectures allow attributes to be associated with ranges of physical memory. These attributes can be manipulated via ioctl() calls performed on /dev/mem. Declarations and data types are to be found in sys/memrange.h. The specific attributes, and number of programmable ranges may vary between architectures. The full set of supported attributes is: MDF_UNCACHEABLE The region is not cached. MDF_WRITECOMBINE Writes to the region may be combined or performed out of order. MDF_WRITETHROUGH Writes to the region are committed synchronously. MDF_WRITEBACK Writes to the region are committed asynchronously. MDF_WRITEPROTECT The region cannot be written to. This requires knowledge of the physical addresses, but I believe that's probably already necessary for what it sounds like you're trying to accomplish. Back in the FreeBSD-3.0 days, I was writing a custom driver for an AGP graphics controller, and setting the MTRR flags for the exposed buffer was a definite improvement (200-1200% faster in most cases). This is MTRR, which is what we currently do, when we can. The issue is that often times the BIOS maps ranges in a way that prevents us from using MTRR. This is generally ideal for things like agp and framebuffers when it works, since they have a specific physical range that you want to work with. With PCI(E) cards it isn't as cut and dry... In the ATI and Nouveau cases, we map scatter gather pages into the GART, which generally are allocated using contigmalloc behind the scenes, so it is also possible for it to work in that case. Moving forward, we may actually be mapping random pages into and out of the GART (GEM / TTM). In those cases we really don't have a large contiguous range that we could set MTRR on. Intel CPUs are limited to 8 MTRR registers for the entire system also, so that can become an issue quickly if you are trying to manipulate several areas of memory. With PAT we can manipulate the caching properties on a page level. PAT also allows for some overlap conditions that MTRR won't, such as mapping a page write-combining on top on an UNCACHEABLE MTRR. jhb@ has started some work on this, since I've been badgering him about this recently as well. robert. -- Kevin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org -- Robert Noland rnol...@freebsd.org FreeBSD signature.asc Description: This is a digitally signed message part
Re: Question about adding flags to mmap system call / NVIDIA amd64 driver implementation
Robert Noland wrote: On Tue, 2009-04-28 at 16:48 -0500, Kevin Day wrote: On Apr 28, 2009, at 3:19 PM, Julian Bangert wrote: Hello, I am currently trying to work a bit on the remaining missing feature that NVIDIA requires ( http://wiki.freebsd.org/NvidiaFeatureRequests or a back post in this ML) - the improved mmap system call. you might check with jhb (john Baldwin) as I think (from his p4 work) that he may be doing something in this area in p4. For now, I am trying to extend the current system call and implementation to add cache control ( the type of memory caching used) . This feature inherently is very architecture specific- but it can lead to enormous performance improvements for memmapped devices ( useful for drivers, etc). I would do this at the user site by adding 3 flags to the mmap system call (MEM_CACHE__ATTR1 to MEM_CACHE__ATTR3 ) which are a single octal digit corresponding to the various caching options ( like Uncacheable,Write Combining, etc... ) with the same numbers as the PAT_* macros from i386/include/ specialreg.h except that the value 0 ( PAT_UNCACHEABLE ) is replaced with value 2 ( undefined), whereas value 0 ( all 3 flags cleared) is assigned the meaning feature not used, use default cache control. For each cache behaviour there would of course also be a macro expanding to the rigth combination of these flags for enhanced useability. The mmap system call would, if any of these flags are set, decode them and get a corresponding PAT_* value, perform the mapping and then call into the pmap module to modify the cache attributes for every page. Have you looked at mem(4) yet? Several architectures allow attributes to be associated with ranges of physical memory. These attributes can be manipulated via ioctl() calls performed on /dev/mem. Declarations and data types are to be found in sys/memrange.h. The specific attributes, and number of programmable ranges may vary between architectures. The full set of supported attributes is: MDF_UNCACHEABLE The region is not cached. MDF_WRITECOMBINE Writes to the region may be combined or performed out of order. MDF_WRITETHROUGH Writes to the region are committed synchronously. MDF_WRITEBACK Writes to the region are committed asynchronously. MDF_WRITEPROTECT The region cannot be written to. This requires knowledge of the physical addresses, but I believe that's probably already necessary for what it sounds like you're trying to accomplish. Back in the FreeBSD-3.0 days, I was writing a custom driver for an AGP graphics controller, and setting the MTRR flags for the exposed buffer was a definite improvement (200-1200% faster in most cases). This is MTRR, which is what we currently do, when we can. The issue is that often times the BIOS maps ranges in a way that prevents us from using MTRR. This is generally ideal for things like agp and framebuffers when it works, since they have a specific physical range that you want to work with. With PCI(E) cards it isn't as cut and dry... In the ATI and Nouveau cases, we map scatter gather pages into the GART, which generally are allocated using contigmalloc behind the scenes, so it is also possible for it to work in that case. Moving forward, we may actually be mapping random pages into and out of the GART (GEM / TTM). In those cases we really don't have a large contiguous range that we could set MTRR on. Intel CPUs are limited to 8 MTRR registers for the entire system also, so that can become an issue quickly if you are trying to manipulate several areas of memory. With PAT we can manipulate the caching properties on a page level. PAT also allows for some overlap conditions that MTRR won't, such as mapping a page write-combining on top on an UNCACHEABLE MTRR. jhb@ has started some work on this, since I've been badgering him about this recently as well. robert. -- Kevin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org