On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:
> On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:
> >On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
> >>Hi Peter,
> >>On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
> >>>When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set 
> >>>for
> >>>X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 
> >>>64b
> >>>PTE are reserved and have to be set to zero. For example, for IA-32e and 
> >>>4KB
> >>>page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are 
> >>>reserved. So
> >>>for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If 
> >>>one of
> >>>the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is 
> >>>generated
> >>>with RSVD error code.
> >>>
> >>><quote>
> >>>RSVD flag (bit 3).
> >>>This flag is 1 if there is no valid translation for the linear address 
> >>>because a
> >>>reserved bit was set in one of the paging-structure entries used to 
> >>>translate
> >>>that address. (Because reserved bits are not checked in a paging-structure 
> >>>entry
> >>>whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
> >>>set.)
> >>></quote>
> >>>
> >>>In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it 
> >>>always
> >>>returns 1 on x86. So it's possible to use any pgoff we want and to set the 
> >>>PTE's
> >>>reserved bits in remap_pfn_range(). Meaning there is a possibility to use 
> >>>mmap
> >>In this case, remap_pfn_range() setup the map and reserved bits for
> >>mmio memory, so the mmio memory is already populated, why trigger
> >>#PF?
> >Hi,
> >
> >I think this is described in the quote above for the RSVD flag.
> >
> >remap_pfn_range() => page present => touch page => tlb miss =>
> >walk through paging structures => reserved bit set => #pf with rsvd flag
> 
> Page present can also trigger #PF? why?

Yes, please see 
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A

4.7 PAGE-FAULT EXCEPTIONS
<quote>
ยท RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear address because
a reserved bit was set in one of the paging-structure entries used to
translate that address. (Because reserved bits are not checked in a
paging-structure entry whose P flag is 0, bit 3 of the error code can be set
only if bit 0 is also set.) Bits reserved in the paging-structure entries are
reserved for future functionality. Software developers should be aware that
such bits may be used in the future and that a paging-structure entry that
causes a page-fault exception on one processor might not do so in the future.
</quote>

I cannot tell you why. I guess this is more a question for some Intel guys.

Anyway this patch is trying to fix the following problem and
the "Bad pagetable" oops.

---------------------------------8<--------------------------------------
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <err.h>
#include <stdlib.h>
#include <sys/mman.h>

#define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)

/*
   1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is defined
   $ cat /proc/iomem | grep -v "\(System RAM\|reserved\)"

   2) Find physical address width
   $ cat /proc/cpuinfo | grep "address sizes"

   PTE bits 51 - M are reserved, where M is physical address width found 2)
   Note: step 2) is actually not needed, we can always set just the 51th bit 
   (0x8000000000000)

   Set OFFSET macro to

   (start of iomem range found in 1)) | (1 << 51)

   for example
   0x000a0000 | 0x8000000000000 = 0x80000000a0000

   where 0x000a0000 is start of PCI BUS on my laptop

 */

#define OFFSET 0x80000000a0000LL

int main(int argc, char *argv[])
{
        int fd;
        long ps;
        long pgoff;
        char *map;
        char c;

        ps = sysconf(_SC_PAGE_SIZE);
        if (ps == -1)
                die("cannot get page size");

        fd = open("/dev/mem", O_RDONLY);
        if (fd == -1)
                die("cannot open /dev/mem");

        printf("%Lx\n", pgoff);
        pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
        printf("%Lx\n", pgoff);

        map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
        if (map == MAP_FAILED)
                die("cannot mmap");

        c = map[0];

        if (munmap(map, ps) == -1)
                die("cannot munmap");

        if (close(fd) == -1)
                die("cannot close");

        return 0;
}
---------------------------------8<--------------------------------------

Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted page table 
at address 7f34087c8000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b3067 PUD 12d544067 
PMD 12e29d067 PTE 80080000000a0225
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.820216] Bad pagetable: 000d [#1] SMP
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.822821] Modules linked in: fuse 
ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 
nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables 
ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i 
cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa 
ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep 
arc4 iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel 
snd_hda_codec uvcvideo snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt 
videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb snd_page_alloc 
bluetooth snd_timer thinkpad_acpi iwlwifi media snd i2c_i801 cfg80211 
iTCO_vendor_support intel_ips e1000e coretemp lpc_ich mfd_core soundcore rfkill 
mei microcode nfsd auth_rpcgss nfs_acl lockd sunrpc vhost_net tun macvtap 
macvlan kvm_intel kvm binfmt_misc uinput dm_crypt crc32c_intel i915 
ghash_clmulni_intel firewire_ohci i2c_algo_bit drm_kms_helper firewire_core 
sdhci_pci crc_itu_t drm sdhci mmc_core i2c_core mxm_wmi video wmi
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845686] CPU 3
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845709] Pid: 8751, comm: pfrsvd Not 
tainted 3.8.1-201.fc18.x86_64 #1 LENOVO 4384AV1/4384AV1
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.852876] RIP: 
0033:[<00000000004007db>]  [<00000000004007db>] 0x4007da
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.856587] RSP: 002b:00007ffff5c12620  
EFLAGS: 00010213
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.860296] RAX: 00007f34087c8000 RBX: 
0000000000000000 RCX: 00000030fd4eed6a
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.864061] RDX: 0000000000000001 RSI: 
0000000000001000 RDI: 0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.867878] RBP: 00007ffff5c12660 R08: 
0000000000000003 R09: 00080000000a0000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.871706] R10: 0000000000000001 R11: 
0000000000000206 R12: 00000000004005f0
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.875566] R13: 00007ffff5c12740 R14: 
0000000000000000 R15: 0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.879490] FS:  00007f34087a0740(0000) 
GS:ffff880137d80000(0000) knlGS:0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.883447] CS:  0010 DS: 0000 ES: 0000 
CR0: 0000000080050033
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.887436] CR2: 00007f34087c8000 CR3: 
0000000107509000 CR4: 00000000000007e0
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.891495] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.895603] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.899739] Process pfrsvd (pid: 8751, 
threadinfo ffff880104ea8000, task ffff88012d9e1760)
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.903944]
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.908169] RIP  [<00000000004007db>] 
0x4007da
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.912447]  RSP <00007ffff5c12620>
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.943802] ---[ end trace 
1113d12a53145197 ]---

Please note the PTE value 80080000000a0225

HTH

Thank you
> 
> >
> >I hope I didn't misunderstand your question.
> >
> >Thanks
> >
> >>>on /dev/mem and cause system panic. It's probably not that serious, because
> >>>access to /dev/mem is limited and the system has to have panic_on_oops 
> >>>set, but
> >>>still I think we should check this and return error.
> >>>
> >>>This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same 
> >>>way
> >>>as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL 
> >>>if the
> >>>requested phys addr is bigger then the supported phys addr width.
> >>>
> >>>[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
> >>>
> >>>Signed-off-by: Frantisek Hrbata <fhrb...@redhat.com>
> >>>---
> >>>  arch/x86/include/asm/io.h |  4 ++++
> >>>  arch/x86/mm/mmap.c        | 13 +++++++++++++
> >>>  2 files changed, 17 insertions(+)
> >>>
> >>>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> >>>index d8e8eef..39607c6 100644
> >>>--- a/arch/x86/include/asm/io.h
> >>>+++ b/arch/x86/include/asm/io.h
> >>>@@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
> >>>  #endif
> >>>  }
> >>>+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
> >>>+extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
> >>>+extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
> >>>+
> >>>  #endif /* __KERNEL__ */
> >>>  extern void native_io_delay(void);
> >>>diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> >>>index 845df68..92ec31c 100644
> >>>--- a/arch/x86/mm/mmap.c
> >>>+++ b/arch/x86/mm/mmap.c
> >>>@@ -31,6 +31,8 @@
> >>>  #include <linux/sched.h>
> >>>  #include <asm/elf.h>
> >>>+#include "physaddr.h"
> >>>+
> >>>  struct __read_mostly va_alignment va_align = {
> >>>   .flags = -1,
> >>>  };
> >>>@@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> >>>           mm->unmap_area = arch_unmap_area_topdown;
> >>>   }
> >>>  }
> >>>+
> >>>+int valid_phys_addr_range(phys_addr_t addr, size_t count)
> >>>+{
> >>>+  return addr + count <= __pa(high_memory);
> >>>+}
> >>>+
> >>>+int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
> >>>+{
> >>>+  resource_size_t addr = (pfn << PAGE_SHIFT) + count;
> >>>+  return phys_addr_valid(addr);
> >>>+}
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Frantisek Hrbata
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to