Hi

Just as introduction. I'm maintaining the XFree86 packages at SuSE and
therefore I'm also responsible for XFree86 4.x/DRI support on SuSE Linux.

I would like to let you know about some pte/highmem changes in the SuSE
kernel of SuSE 8.0 and in upcoming upstream kernel releases. Andrea
Arcangeli - you might know him as kernel developper - wrote
the document. Additionally I attach two patches we apply for the DRM
XFree86 modules we use for SuSE 8.0. The first one is the required
pte/highmem patch for the SuSE 8.0 kernel. The second one is required
since kernel 2.4.18 pre7/pre8 and we needed it for an update kernel.

Hope you consider to integrate these changes in upcoming DRM releases.

If you have any questions feel free to contact me directly as I do
not read this mailing list. Maybe I should better do this as DRI 
maintainer at SuSE. :-)

Stefan

Public Key available
----------------------------------------------------
Stefan Dirsch (Res. & Dev.)   SuSE Linux AG
Tel: 0911-740530              Deutschherrnstr. 15-19
FAX: +49 911 741 77 55        D-90429 Nuernberg
http://www.suse.de            Germany 
----------------------------------------------------
Title: Device driver updates required by pte-highmem

What is pte-highmem

pte-highmem is a feature that allows pagetables to be allocated in highmemory (above 800M physical). This is a critical requirement for highmem servers mapping plenty of shared memory like databases (that would otherwise run out of lowmem with lots of tasks mapping the SHM).

The pte-highmem feature is included into the SuSE kernels starting from 2.4.18-SuSE in SuSE Linux 8.0.

Why some device driver needs changes to operate correctly with pte-highmem

In theory no device driver should be required to touch (read/write) pagetables by hand. Proper functions like remap_page_range()/ioremap()/vmalloc()/vfree(), are meant to deal with pagetables in a transparent manner with respect to device drivers.

In practice a few device drivers are currently walking kernel pagetables by hand to find the physical pages that are backing a vmalloc virtual memory area. Those drivers need to find the physical pages in order to setup DMA SG entries to allow devices to transfer data on those vmalloc areas.

The reason those drivers were walking pagetables by hand is that the functionality to resolve a virtual vmalloc page to a physical page wasn't provided with a proper common code functionality and so all the details of the pagetable handling were exposed to the lowlevel drivers. This is been fixed since the stable >=2.4.19 kernels and in the 2.5.x branch.

Long term solution: vmalloc_to_page()

The new functionality available since 2.4.19 is called vmalloc_to_page():
    struct page * vmalloc_to_page(void * vmalloc_addr)
      
  • 'vmalloc_addr' is the vmalloc virtual address.
  • 'page' is the physical page that backs the vmalloc virtual address.
vmalloc_to_page() takes care internally of all the pagetable handling details and in turn it hides those details completely to the lowlevel device drivers.

Example of conversion

This is an example of old code before vmalloc_to_page() is been made available (taken from drivers/media/video/meye.c):
    static inline unsigned long uvirt_to_kva(pgd_t *pgd, unsigned long adr) {
            unsigned long ret = 0UL;
    	pmd_t *pmd;
    	pte_t *ptep, pte;
      
    	if (!pgd_none(*pgd)) {
                    pmd = pmd_offset(pgd, adr);
                    if (!pmd_none(*pmd)) {
                            ptep = pte_offset(pmd, adr);
                            pte = *ptep;
                            if(pte_present(pte)) {
    				ret = (unsigned long)page_address(pte_page(pte));
    				ret |= (adr & (PAGE_SIZE - 1));
    				
    			}
                    }
            }
            MDEBUG(printk("uv2kva(%lx-->%lx)\n", adr, ret));
    	return ret;
    }
    
    static inline unsigned long kvirt_to_pa(unsigned long adr) {
            unsigned long va, kva, ret;
    
            va = VMALLOC_VMADDR(adr);
            kva = uvirt_to_kva(pgd_offset_k(va), va);
    	ret = __pa(kva);
            MDEBUG(printk("kv2pa(%lx-->%lx)\n", adr, ret));
            return ret;
    }
      
This is the same code but converted to use vmalloc_to_page():
    static inline unsigned long kvirt_to_pa(unsigned long adr) {
            unsigned long kva, ret;
    
            kva = (unsigned long) page_address(vmalloc_to_page((void *)adr));
    	kva |= adr & (PAGE_SIZE-1); /* restore the offset */
    	ret = __pa(kva);
            return ret;
    }
      
NOTE: the above page_address(vmalloc_to_page((void *)adr)) is valid only because the vmalloc space is been allocated with vmalloc_32, otherwise kvirt_to_pa() itself would be broken (it can only return a physical address in the 0-4G range). With pci64 devices it would be better to allocate the vmalloc memory with a plain vmalloc() and to work with 'page structures' instead of physical addresses, that would save some lowmem ram and working with page structures is cleaner anyways.

Lack of vmalloc_to_page() in the <= 2.4.18 kernels

While the above is the right way to update those drivers for kernels >=2.4.19 and 2.5 to let them work fine with the new pte-highmem feature, the 2.4.18 kernel didn't yet provided the vmalloc_to_page() functionality to modules.

In turn the 2.4.18-SuSE kernel (based on 2.4.18) in SuSE Linux 8.0 doesn't yet export a vmalloc_to_page() either.

So for the 2.4.18-SuSE kernel the pagetable handling function in the drivers must be updated to handle the pte-highmem feature. Either that or the function vmalloc_to_page() should be cut and pasted from the 2.4.19/mm/memory.c file to the device driver .c file.

Example of device driver update for the 2.4.18-SuSE kernel

Old non pte-highmem capable code (again from meye.c):
    static inline unsigned long uvirt_to_kva(pgd_t *pgd, unsigned long adr) {
            unsigned long ret = 0UL;
    	pmd_t *pmd;
    	pte_t *ptep, pte;
      
    	if (!pgd_none(*pgd)) {
                    pmd = pmd_offset(pgd, adr);
                    if (!pmd_none(*pmd)) {
                            ptep = pte_offset(pmd, adr);
                            pte = *ptep;
                            if(pte_present(pte)) {
    				ret = (unsigned long)page_address(pte_page(pte));
    				ret |= (adr & (PAGE_SIZE - 1));
    				
    			}
                    }
            }
            MDEBUG(printk("uv2kva(%lx-->%lx)\n", adr, ret));
    	return ret;
    }
      
Updated pte-highmem capable code (not yet using vmalloc_to_page() because not yet available under 2.4.18):
    static inline unsigned long uvirt_to_kva(pgd_t *pgd, unsigned long adr) {
            unsigned long ret = 0UL;
    	pmd_t *pmd;
    	pte_t *ptep, pte;
      
    	if (!pgd_none(*pgd)) {
                    pmd = pmd_offset(pgd, adr);
                    if (!pmd_none(*pmd)) {
                            ptep = pte_offset_atomic(pmd, adr);
                            pte = *ptep;
    			pte_kunmap(ptep);
                            if(pte_present(pte)) {
    				ret = (unsigned long)page_address(pte_page(pte));
    				ret |= (adr & (PAGE_SIZE - 1));
    				
    			}
                    }
            }
            MDEBUG(printk("uv2kva(%lx-->%lx)\n", adr, ret));
    	return ret;
    }
      
Unified diff:
    diff -urN pte-highmem-ref/drivers/media/video/meye.c pte-highmem/drivers/media/video/meye.c
    --- pte-highmem-ref/drivers/media/video/meye.c	Tue Mar 12 00:07:13 2002
    +++ pte-highmem/drivers/media/video/meye.c	Tue Mar 12 14:26:10 2002
    @@ -129,8 +129,9 @@
     	if (!pgd_none(*pgd)) {
                     pmd = pmd_offset(pgd, adr);
                     if (!pmd_none(*pmd)) {
    -                        ptep = pte_offset(pmd, adr);
    +                        ptep = pte_offset_atomic(pmd, adr);
                             pte = *ptep;
    +			pte_kunmap(ptep);
                             if(pte_present(pte)) {
     				ret = (unsigned long)page_address(pte_page(pte));
     				ret |= (adr & (PAGE_SIZE - 1));
      
Some driver may also need a:
      #include <linux/highmem.h>
      
at the top of the file, in order to compile correctly.

pte_offset_atomic() won't block (no spinning and no scheduler calls inside) and it can be recalled from all normal kernel context (not from irq/bh context though: touching pagetables from irq/bh would be a bug in the first place). It can be recalled with spinlocks hold because it doesn't block and it doesn't acquire any other lock internally.

pte_offset_atomic() opens a critical section that must be closed from pte_kunmap(). This is the big difference introduced by the pte-highmem design: it requires the closure of the critical section with pte_kunmap().

The driver is not allowed to sleep within the critical section (it is an atomic kmap). For this reason it is recommended to read the contents of the pagetables right after pte_offset_atomic(), and to pte_kunmap() right after the read. Here it is another example covering this case:

    diff -urN NVIDIA_kernel-1.0-2313/nv.c NVIDIA_kernel-1.0-2313.pte-highmem/nv.c
    --- NVIDIA_kernel-1.0-2313/nv.c	Tue Nov 27 21:39:17 2001
    +++ NVIDIA_kernel-1.0-2313.pte-highmem/nv.c	Sun Feb  3 16:35:18 2002
    @@ -42,6 +42,7 @@
     #include <linux/interrupt.h>           
     #include <linux/tqueue.h>               // struct tq_struct 
     #include <linux/poll.h>
    +#include <linux/highmem.h>
     #ifdef CONFIG_PM
     #include <linux/pm.h>                   // power management
     #endif
    @@ -2267,7 +2268,7 @@
     {
         pgd_t *pg_dir;
         pmd_t *pg_mid_dir;
    -    pte_t *pg_table;
    +    pte_t *pg_table, pte;
     
         /* XXX do we really need this? */
         if (address > VMALLOC_START)
    @@ -2297,11 +2298,13 @@
         if (pmd_none(*pg_mid_dir))
             goto failed;
     
    -    pg_table = pte_offset(pg_mid_dir, address);
    -    if (!pte_present(*pg_table))
    +    pg_table = pte_offset_atomic(pg_mid_dir, address); /* map */
    +    pte = *pg_table; /* read */
    +    pte_kunmap(pg_table); /* unmap */
    +    if (!pte_present(pte))
             goto failed;
     
    -    return ((pte_val(*pg_table) & KERN_PAGE_MASK) | NV_MASK_OFFSET(address));
    +    return ((pte_val(pte) & KERN_PAGE_MASK) | NV_MASK_OFFSET(address));
     
       failed:
         return (unsigned long) NULL;
      

How to write sources that can be compiled cleanly with all kernels out there

With some #ifdef trickery it is possible to write code that compiles cleanly under all kernels out there:
  • <= 2.4.18 (use old pte code)
  • 2.4.18-SuSE (use new pte-highmem methods)
  • >= 2.4.19 and 2.5.x (use of vmalloc_to_page())

Here an example:

    #if LINUX_VERSION_CODE >= 0x020419
     /* Kernel >= 2.4.19: use of vmalloc_to_page() */
    #else
     #include <linux/highmem.h>
     #ifdef pte_offset_atomic
      /* SuSE Kernel 2.4.18: use new pte-highmem methods */
     #else
      /* Kernel < 2.4.19: use old pte code */
     #endif
    #endif
      
This relies on the fact pte_offset_atomic() is a preprocessor macro, and that's not going to change.

NOTE: in the long term the #ifdef trickery should go away for code clarity, and only the vmalloc_to_page() way should be retained.

Drivers walking pagetables not to find the vmalloc() physical pages

They can be short-term adapted using pte_offset_atomic()/pte_kunmap() too like if it would be a <= 2.4.18 kernel dealing with the vmalloc areas (as outlined in the previous section).

However long term those drivers would better use other common code API provided by the kernel like map_user_kiobuf() or get_user_pages() (the latter it's not yet exported to modules but it should be ok to export it too if necessary). Then they wouldn't depend any longer on the lowlevel details of the memory management and in turn they wouldn't break that easily in the long term.

vmalloc_to_page() and EXPORT_SYMBOL_GPL()

Originally vmalloc_to_page() was exported to modules using EXPORT_SYMBOL_GPL (this mean only GPL modules could use it).

That happened for no good reason and it will be fixed. I doesn't make sense to require non-GPL modules to walk pagetables by hand. A discussion covering this topic extensively can be found on the l-k mailing list in April 2002.

Conclusions

No driver should touch pagetables by hand. Drivers for new >=2.4.19 and 2.5.x kernels will use vmalloc_to_page()/map_user_kiobuf() if necessary. vmalloc_to_page() internally handles the highmem pagetables just fine and so the pagetable handling become completely transparent to the device drivers.

Kernels <= 2.4.18 shipping with the pte-highmem feature (like the 2.4.18-SuSE kernel in SuSE Linux 8.0) don't yet provide the universal vmalloc_to_page() functionality, and so any driver out of the kernel tree that needs to walk pagetables by hand will need a few liner patch (pte_offset_atomic/pte_kunmap) to handle the highmem pagetables correctly.

8 April 2002 - Andrea Arcangeli - SuSE

diff -u ../drm.orig/drm_scatter.h ./drm_scatter.h
--- ../drm.orig/drm_scatter.h   Mon Feb  4 12:40:38 2002
+++ ./drm_scatter.h     Mon Feb  4 14:58:04 2002
@@ -68,7 +68,7 @@
        unsigned long pages, i, j;
        pgd_t *pgd;
        pmd_t *pmd;
-       pte_t *pte;
+       pte_t *ptep, pte;
 
        DRM_DEBUG( "%s\n", __FUNCTION__ );
 
@@ -143,11 +143,13 @@
                if ( !pmd_present( *pmd ) )
                        goto failed;
 
-               pte = pte_offset( pmd, i );
-               if ( !pte_present( *pte ) )
+               ptep = pte_offset_atomic( pmd, i );
+                pte = *ptep;
+                pte_kunmap(ptep);
+               if ( !pte_present( pte ) )
                        goto failed;
 
-               entry->pagelist[j] = pte_page( *pte );
+               entry->pagelist[j] = pte_page( pte );
 
                SetPageReserved( entry->pagelist[j] );
        }
diff -u ../drm.orig/drm_vm.h ./drm_vm.h
--- ../drm.orig/drm_vm.h        Mon Feb  4 12:40:38 2002
+++ ./drm_vm.h  Mon Feb  4 14:58:56 2002
@@ -132,7 +132,7 @@
        unsigned long    i;
        pgd_t            *pgd;
        pmd_t            *pmd;
-       pte_t            *pte;
+       pte_t            *ptep, pte;
        struct page      *page;
 
        if (address > vma->vm_end) return NOPAGE_SIGBUS; /* Disallow mremap */
@@ -147,10 +147,12 @@
        if( !pgd_present( *pgd ) ) return NOPAGE_OOM;
        pmd = pmd_offset( pgd, i );
        if( !pmd_present( *pmd ) ) return NOPAGE_OOM;
-       pte = pte_offset( pmd, i );
-       if( !pte_present( *pte ) ) return NOPAGE_OOM;
+       ptep = pte_offset_atomic( pmd, i );
+        pte = *ptep;
+        pte_kunmap(ptep);
+       if( !pte_present( pte ) ) return NOPAGE_OOM;
 
-       page = pte_page(*pte);
+       page = pte_page(pte);
        get_page(page);
 
        DRM_DEBUG("0x%08lx => 0x%08x\n", address, page_to_bus(page));
diff -u ../kernel.orig/i810_dma.c ./i810_dma.c
--- ../kernel.orig/i810_dma.c   Fri May  3 14:16:32 2002
+++ ./i810_dma.c        Fri May  3 16:07:24 2002
@@ -75,6 +75,10 @@
        outring &= ringmask;                                            \
 } while (0);
 
+#if defined(put_page) && defined(UnlockPage)
+#define AGPGART_2_4_19
+#endif /* put_page && UnlockPage */
+
 static inline void i810_print_status_page(drm_device_t *dev)
 {
        drm_device_dma_t *dma = dev->dma;
@@ -290,12 +294,21 @@
 
 static void i810_free_page(drm_device_t *dev, unsigned long page)
 {
+#ifdef AGPGART_2_4_19
+    struct page *p;
+#endif /* AGPGART_2_4_19 */
        if(page == 0UL)
                return;
 
+#ifdef AGPGART_2_4_19
+    p = virt_to_page(page);
+    put_page(p);
+    UnlockPage(p);
+#else /* AGPGART_2_4_19 */
        atomic_dec(&virt_to_page(page)->count);
        clear_bit(PG_locked, &virt_to_page(page)->flags);
        wake_up(&virt_to_page(page)->wait);
+#endif /* AGPGART_2_4_19 */
        free_page(page);
        return;
 }
diff -u ../kernel.orig/i830_dma.c ./i830_dma.c
--- ../kernel.orig/i830_dma.c   Fri May  3 14:16:32 2002
+++ ./i830_dma.c        Fri May  3 16:08:25 2002
@@ -93,6 +93,10 @@
        outring &= ringmask;                                            \
 } while (0);
 
+#if defined(put_page) && defined(UnlockPage)
+#define AGPGART_2_4_19
+#endif /* put_page && UnlockPage */
+
 static inline void i830_print_status_page(drm_device_t *dev)
 {
        drm_device_dma_t *dma = dev->dma;
@@ -312,12 +316,21 @@
 
 static void i830_free_page(drm_device_t *dev, unsigned long page)
 {
+#ifdef AGPGART_2_4_19
+    struct page *p;
+#endif /* AGPGART_2_4_19 */
        if(page == 0UL) 
                return;
        
+#ifdef AGPGART_2_4_19
+    p = virt_to_page(page);
+    put_page(p);
+    UnlockPage(p);
+#else /* AGPGART_2_4_19 */
        atomic_dec(&virt_to_page(page)->count);
        clear_bit(PG_locked, &virt_to_page(page)->flags);
        wake_up(&virt_to_page(page)->wait);
+#endif /* AGPGART_2_4_19 */
        free_page(page);
        return;
 }
diff -u ../drm.old/drm_scatter.h ./drm_scatter.h
--- ../drm.old/drm_scatter.h    Sat May  4 19:57:52 2002
+++ ./drm_scatter.h     Sat May  4 19:59:19 2002
@@ -32,6 +32,8 @@
 #include <linux/vmalloc.h>
 #include "drmP.h"
 
+#include <linux/highmem.h>
+
 #define DEBUG_SCATTER 0
 
 void DRM(sg_cleanup)( drm_sg_mem_t *entry )
diff -u ../drm.old/drm_vm.h ./drm_vm.h
--- ../drm.old/drm_vm.h Sat May  4 19:57:52 2002
+++ ./drm_vm.h  Sat May  4 20:00:12 2002
@@ -32,6 +32,8 @@
 #define __NO_VERSION__
 #include "drmP.h"
 
+#include <linux/highmem.h>
+
 struct vm_operations_struct   DRM(vm_ops) = {
        nopage:  DRM(vm_nopage),
        open:    DRM(vm_open),

Reply via email to