Re: 32-bit memory limits IN DETAIL (Was: perspectives on 32 bit vs 64 bit)
Martin Kuball wrote: Am Tuesday, 25. October 2005 02:31 schrieb [EMAIL PROTECTED]: [snip] Because the kernel address space has to hold more than just RAM (in particular, it also has to hold memory-mapped PCI devices like video cards), if you have 1G of physical memory, the kernel will by default only use 896M of it, leaving 128M of kernel address space for PCI devices. A different user/kernel split can help there. I use 2.75/1.25G on 1G RAM machines, but if you use PAE or NX, the split has to be on a 1G boundary. But these are all workarounds. The real solution is to use a larger virtual address space so that the original, efficient technique of mapping both the user's virtual address space and the kernel's address space (basically a copy of physical memory) will both fit. And what about 64bit systems? How is the splitting done there? Do I have to worry? The problem is exactly the same, but on a larger scale. For 32-bit processors, you get trouble when your programs need near 2^32 bytes or more. (I.e. 4GB.) For a true 64-bit processor, you get the same troubles the day you need near 2^64 bytes or more per process. Nobody is anywhere near this limit yet. Helge Hafting -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: 32-bit memory limits IN DETAIL (Was: perspectives on 32 bit vs 64 bit)
Jeudi 3 novembre 2005, 15:42:30 CET, Helge Hafting a écrit : [...] Nobody is anywhere near this limit yet. Helge Hafting Sure, 640kB^W 16EB ought to be enough. ;o) -- Sylvain Sauvage
Re: 32-bit memory limits IN DETAIL (Was: perspectives on 32 bit vs 64 bit)
Hi I'm now worried as I nearly understand this!! I need to play more guitar!! Many thanks for a well written mail. cheers Bob [EMAIL PROTECTED] wrote: This seems to come up every now and then, so let me explain. None of this is new information, butt can be a bit confusing. First, i386 memory addressing. The i386 is unlike all other processors in that there are two levels of address translation that take place. First, we have a 16-bit segment + 32-bit offset VIRTUAL address. Now, 3 bits of that segment are sort of taken (2 bits of RPL and 1 local/global bit), so you really only get 8192 segments per process. This VIRTUAL address is then translated into a 32-bit LINEAR address by checking the offset against the segment limit and adding the segment base. Then this 32-bit LINEAR address is fed to a standard page-based MMU, producing a 32- or 36-bit PHYSICAL address. Most processors go VIRTUAL --page tables-- PHYSICAL. i386 goes VIRTUAL --segments-- LINEAR --page tables-- PHYSICAL. The bottleneck is the 32-bit LINEAR address space. A process can have at most 2^32 bytes addressible at any one time without the operating system rewriting the page tables. Note first of all that, if you actively use more than one segment at a time (such as for code, stack and data), this limits your maximum segment size to less than 2^32 bytes each, since the TOTAL of the sumultaneously accessible segments has to fit within 2^32 bytes. So, for example, if you had two segments of 4G, you could not have them both resident at the same time, and so you could not get a MOV instruction from one to the other to complete. (And the MOV instruction itself would have to go somewhere.) Thus, you can not actually reach the 2^45-byte addressing limit that up to 2^13 segments of up to 2^32 bytes each implies. Secondly, even if you do demand segmentation, bringing segments into and out of the 32-bit LINEAR address space, this still requires that the operating system rewrite the page tables (and invalidate the TLB entried) in response to segment faults in order to access the relevant bits of PHYSICAL memory. This is exactly the SAME operating system and hardware overhead as using mmap or mremap to remap bits of a linear address space. The only difference would be if it were much easier for the user program to deal with segments than to deal with explicit dynamic mmaps. And it's not at all clear that it is. For these reasons, 32-bit x86 operating systems tend to ignore the segmentation features and just use paging. It just isn't worth the complexity, and for multi-platform operating systems like Linux, it isn't worth the portability hassles. In fact, this has in turn led to x86 designers de-emphasizing segment register loading speed, so large model programs that use multiple segments take a significant speed hit. Now, for why the Linux kernel takes 1 GB of virtual address space... Every time a user-space program does a read() or write() call, or makes any similar system call that moves a buffer of data, the kernel has to copy between the user buffers and its own private file cache. For this to be possible, the two source and destination buffers must be in the same VIRTUAL address space. And for it to be remotely efficient, they have to be in the same LINEAR address space as well. Now, it is possible to have a separate kernel address space, and demand-map user-space buffers into it to do the copying. That's what the 4G+4G patches do. But that means that on EVERY system call, you have to change the page tables around, which results in flushing the TLB and a lot of overhead. The default Linux config arranges for the kernel's address space and the user's address space to both be present at the same time. Page table entries have a permission bit that lets them be inaccessible to user mode but accessed from kernel mode without having to reload the TLB. This is very fast. But it results in the classic split between 3G of user address space and 1G of kernel address space. It could be done different ways, but *any alternative would be much slower* for typical programs that don't need more than 3G of address space. The things that's causing a real problem is that common physical memory sizes are approaching the 4G address space. Thus, it's no longer guaranteed that the 1G of kernel space is big enough to hold all of physical memory, so kernel access to some parts of it has to be bank-switched (the CONFIG_HIGHMEM options). By careful design, this has been kept reasonably fast, but there is overhead. Because the kernel address space has to hold more than just RAM (in particular, it also has to hold memory-mapped PCI devices like video cards), if you have 1G of physical memory, the kernel will by default only use 896M of it, leaving 128M of kernel address space for PCI devices. A different user/kernel split can help there. I use 2.75/1.25G on 1G RAM machines, but if you use PAE or NX, the split has to be on a 1G
Re: 32-bit memory limits IN DETAIL (Was: perspectives on 32 bit vs 64 bit)
Am Tuesday, 25. October 2005 02:31 schrieb [EMAIL PROTECTED]: [snip] Because the kernel address space has to hold more than just RAM (in particular, it also has to hold memory-mapped PCI devices like video cards), if you have 1G of physical memory, the kernel will by default only use 896M of it, leaving 128M of kernel address space for PCI devices. A different user/kernel split can help there. I use 2.75/1.25G on 1G RAM machines, but if you use PAE or NX, the split has to be on a 1G boundary. But these are all workarounds. The real solution is to use a larger virtual address space so that the original, efficient technique of mapping both the user's virtual address space and the kernel's address space (basically a copy of physical memory) will both fit. And what about 64bit systems? How is the splitting done there? Do I have to worry? Martin -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
32-bit memory limits IN DETAIL (Was: perspectives on 32 bit vs 64 bit)
This seems to come up every now and then, so let me explain. None of this is new information, butt can be a bit confusing. First, i386 memory addressing. The i386 is unlike all other processors in that there are two levels of address translation that take place. First, we have a 16-bit segment + 32-bit offset VIRTUAL address. Now, 3 bits of that segment are sort of taken (2 bits of RPL and 1 local/global bit), so you really only get 8192 segments per process. This VIRTUAL address is then translated into a 32-bit LINEAR address by checking the offset against the segment limit and adding the segment base. Then this 32-bit LINEAR address is fed to a standard page-based MMU, producing a 32- or 36-bit PHYSICAL address. Most processors go VIRTUAL --page tables-- PHYSICAL. i386 goes VIRTUAL --segments-- LINEAR --page tables-- PHYSICAL. The bottleneck is the 32-bit LINEAR address space. A process can have at most 2^32 bytes addressible at any one time without the operating system rewriting the page tables. Note first of all that, if you actively use more than one segment at a time (such as for code, stack and data), this limits your maximum segment size to less than 2^32 bytes each, since the TOTAL of the sumultaneously accessible segments has to fit within 2^32 bytes. So, for example, if you had two segments of 4G, you could not have them both resident at the same time, and so you could not get a MOV instruction from one to the other to complete. (And the MOV instruction itself would have to go somewhere.) Thus, you can not actually reach the 2^45-byte addressing limit that up to 2^13 segments of up to 2^32 bytes each implies. Secondly, even if you do demand segmentation, bringing segments into and out of the 32-bit LINEAR address space, this still requires that the operating system rewrite the page tables (and invalidate the TLB entried) in response to segment faults in order to access the relevant bits of PHYSICAL memory. This is exactly the SAME operating system and hardware overhead as using mmap or mremap to remap bits of a linear address space. The only difference would be if it were much easier for the user program to deal with segments than to deal with explicit dynamic mmaps. And it's not at all clear that it is. For these reasons, 32-bit x86 operating systems tend to ignore the segmentation features and just use paging. It just isn't worth the complexity, and for multi-platform operating systems like Linux, it isn't worth the portability hassles. In fact, this has in turn led to x86 designers de-emphasizing segment register loading speed, so large model programs that use multiple segments take a significant speed hit. Now, for why the Linux kernel takes 1 GB of virtual address space... Every time a user-space program does a read() or write() call, or makes any similar system call that moves a buffer of data, the kernel has to copy between the user buffers and its own private file cache. For this to be possible, the two source and destination buffers must be in the same VIRTUAL address space. And for it to be remotely efficient, they have to be in the same LINEAR address space as well. Now, it is possible to have a separate kernel address space, and demand-map user-space buffers into it to do the copying. That's what the 4G+4G patches do. But that means that on EVERY system call, you have to change the page tables around, which results in flushing the TLB and a lot of overhead. The default Linux config arranges for the kernel's address space and the user's address space to both be present at the same time. Page table entries have a permission bit that lets them be inaccessible to user mode but accessed from kernel mode without having to reload the TLB. This is very fast. But it results in the classic split between 3G of user address space and 1G of kernel address space. It could be done different ways, but *any alternative would be much slower* for typical programs that don't need more than 3G of address space. The things that's causing a real problem is that common physical memory sizes are approaching the 4G address space. Thus, it's no longer guaranteed that the 1G of kernel space is big enough to hold all of physical memory, so kernel access to some parts of it has to be bank-switched (the CONFIG_HIGHMEM options). By careful design, this has been kept reasonably fast, but there is overhead. Because the kernel address space has to hold more than just RAM (in particular, it also has to hold memory-mapped PCI devices like video cards), if you have 1G of physical memory, the kernel will by default only use 896M of it, leaving 128M of kernel address space for PCI devices. A different user/kernel split can help there. I use 2.75/1.25G on 1G RAM machines, but if you use PAE or NX, the split has to be on a 1G boundary. But these are all workarounds. The real solution is to use a larger virtual address space so that the original, efficient technique of mapping