Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Thu, 16 Mar 2017, Thomas Gleixner wrote: > On Thu, 16 Mar 2017, Till Smejkal wrote: > > On Thu, 16 Mar 2017, Thomas Gleixner wrote: > > > Why do we need yet another mechanism to represent something which looks > > > like a file instead of simply using existing mechanisms and extend them? > > > > You are right. I also recognized during the discussion with Andy, Chris, > > Matthew, Luck, Rich and the others that there are already other > > techniques in the Linux kernel that can achieve the same functionality > > when combined. As I said also to the others, I will drop the VAS segments > > for future versions. The first class virtual address space feature was > > the more interesting part of the patchset anyways. > > While you are at it, could you please drop this 'first class' marketing as > well? It has zero technical value, really. Yes of course. I am sorry for the trouble that I caused already. Thanks Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Thu, 16 Mar 2017, Thomas Gleixner wrote: > Why do we need yet another mechanism to represent something which looks > like a file instead of simply using existing mechanisms and extend them? You are right. I also recognized during the discussion with Andy, Chris, Matthew, Luck, Rich and the others that there are already other techniques in the Linux kernel that can achieve the same functionality when combined. As I said also to the others, I will drop the VAS segments for future versions. The first class virtual address space feature was the more interesting part of the patchset anyways. Thanks Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Luck, Tony wrote: > On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote: > > I don't agree here. VAS segments are basically in-memory files that are > > handled by > > the kernel directly without using a file system. Hence, if an application > > uses a VAS > > segment to store data the same rules apply as if it uses a file. Everything > > that it > > saves in the VAS segment might be accessible by other applications. An > > application > > using VAS segments should be aware of this fact. In addition, the resources > > that are > > represented by a VAS segment are not leaked. As I said, VAS segments are > > much like > > files. Hence, if you don't want to use them any more, delete them. But as > > with files, > > the kernel will not delete them for you (although something like this can > > be added). > > So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)? > > Apart from VAS having better names, instead of silly "key_t key" ones. Unfortunately, I have to admit that the VAS segments don't differ from shm* a lot. The implementation is differently, but the functionality that you can achieve with it is very similar. I am sorry. We should have looked more closely at the whole functionality that is provided by the shmem subsystem before working on VAS segments. However, VAS segments are not the key part of this patch set. The more interesting functionality in our opinion is the introduction of first class virtual address spaces and what they can be used for. VAS segments were just another logical step for us (from first class virtual address spaces to first class virtual address space segments) but since their functionality can be achieved with various other already existing features of the Linux kernel, I will probably drop them in future versions of the patchset. Thanks Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > One advantage of VAS segments is that they can be globally queried by user > > programs > > which means that VAS segments can be shared by applications that not > > necessarily have > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > > only work > > if the tasks that share the memory region are related (aka. have a common > > parent that > > initialized the shared mapping). Otherwise, the shared mapping have to be > > backed by a > > file. > > What's wrong with memfd_create()? > > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > What's wrong with regular mmap? I never wanted to say that there is something wrong with regular mmap. We just figured that with VAS segments you could remove the need to mmap your shared data but instead can keep everything purely in memory. Unfortunately, I am not at full speed with memfds. Is my understanding correct that if the last user of such a file descriptor closes it, the corresponding memory is freed? Accordingly, memfd cannot be used to keep data in memory while no program is currently using it, can it? To be able to do this you need again some representation of the data in a file? Yes, you can use a tmpfs to keep the file content in memory as well, or some DAX filesystem to keep the file content in NVM, but this always requires that such filesystems are mounted in the system that the application is currently running on. VAS segments on the other side would provide a functionality to achieve the same without the need of any mounted filesystem. However, I agree, that this is just a small advantage compared to what can already be achieved with the existing functionality provided by the Linux kernel. I probably need to revisit the whole idea of first class virtual address space segments before continuing with this pacthset. Thank you very much for the great feedback. > >> >> Ick. Please don't do this. Can we please keep an mm as just an mm > >> >> and not make it look magically different depending on which process > >> >> maps it? If you need a trampoline (which you do, of course), just > >> >> write a trampoline in regular user code and map it manually. > >> > > >> > Did I understand you correctly that you are proposing that the switching > >> > thread > >> > should make sure by itself that its code, stack, … memory regions are > >> > properly setup > >> > in the new AS before/after switching into it? I think, this would make > >> > using first > >> > class virtual address spaces much more difficult for user applications > >> > to the extend > >> > that I am not even sure if they can be used at all. At the moment, > >> > switching into a > >> > VAS is a very simple operation for an application because the kernel > >> > will just simply > >> > do the right thing. > >> > >> Yes. I think that having the same mm_struct look different from > >> different tasks is problematic. Getting it right in the arch code is > >> going to be nasty. The heuristics of what to share are also tough -- > >> why would text + data + stack or whatever you're doing be adequate? > >> What if you're in a thread? What if two tasks have their stacks in > >> the same place? > > > > The different ASes that a task now can have when it uses first class > > virtual address > > spaces are not realized in the kernel by using only one mm_struct per task > > that just > > looks differently but by using multiple mm_structs - one for each AS that > > the task > > can execute in. When a task attaches a first class virtual address space to > > itself to > > be able to use another AS, the kernel adds a temporary mm_struct to this > > task that > > contains the mappings of the first class virtual address space and the one > > shared > > with the task's original AS. If a thread now wants to switch into this > > attached first > > class virtual address space the kernel only changes the 'mm' and > > 'active_mm' pointers > > in the task_struct of the thread to the temporary mm_struct and performs the > > corresponding mm_switch operation. The original mm_struct of the thread > > will not be > > changed. > > > > Accordingly, I do not magically make mm_structs look differently depending > > on the > > task that uses it, but create temporary mm_structs that only contain > > mappings to the > > same memory regions. > > This sounds complicated and fragile. What happens if a heuristically > shared region coincides with a region in the "first class address > space" being selected? If such a conflict happens, the task cannot use the first class address space and the corresponding system call will return an error.
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Andy Lutomirski wrote: > On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal > <till.smej...@googlemail.com> wrote: > > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > >> > One advantage of VAS segments is that they can be globally queried by > >> > user programs > >> > which means that VAS segments can be shared by applications that not > >> > necessarily have > >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data > >> > will only work > >> > if the tasks that share the memory region are related (aka. have a > >> > common parent that > >> > initialized the shared mapping). Otherwise, the shared mapping have to > >> > be backed by a > >> > file. > >> > >> What's wrong with memfd_create()? > >> > >> > VAS segments on the other side allow sharing of pure in memory data by > >> > arbitrary related tasks without the need of a file. This becomes > >> > especially > >> > interesting if one combines VAS segments with non-volatile memory since > >> > one can keep > >> > data structures in the NVM and still be able to share them between > >> > multiple tasks. > >> > >> What's wrong with regular mmap? > > > > I never wanted to say that there is something wrong with regular mmap. We > > just > > figured that with VAS segments you could remove the need to mmap your > > shared data but > > instead can keep everything purely in memory. > > memfd does that. Yes, that's right. Thanks for giving me the pointer to this. I should have researched more carefully before starting to work at VAS segments. > > VAS segments on the other side would provide a functionality to > > achieve the same without the need of any mounted filesystem. However, I > > agree, that > > this is just a small advantage compared to what can already be achieved > > with the > > existing functionality provided by the Linux kernel. > > I see this "small advantage" as "resource leak and security problem". I don't agree here. VAS segments are basically in-memory files that are handled by the kernel directly without using a file system. Hence, if an application uses a VAS segment to store data the same rules apply as if it uses a file. Everything that it saves in the VAS segment might be accessible by other applications. An application using VAS segments should be aware of this fact. In addition, the resources that are represented by a VAS segment are not leaked. As I said, VAS segments are much like files. Hence, if you don't want to use them any more, delete them. But as with files, the kernel will not delete them for you (although something like this can be added). > >> This sounds complicated and fragile. What happens if a heuristically > >> shared region coincides with a region in the "first class address > >> space" being selected? > > > > If such a conflict happens, the task cannot use the first class address > > space and the > > corresponding system call will return an error. However, with the current > > available > > virtual address space size that programs can use, such conflicts are > > probably rare. > > A bug that hits 1% of the time is often worse than one that hits 100% > of the time because debugging it is miserable. I don't agree that this is a bug at all. If there is a conflict in the memory layout of the ASes the application simply cannot use this first class virtual address space. Every application that wants to use first class virtual address spaces should check for error return values and handle them. This situation is similar to mapping a file at some special address in memory because the file contains pointer based data structures and the application wants to use them, but the kernel cannot map the file at this particular position in the application's AS because there is already a different conflicting mapping. If an application wants to do such things, it should also handle all the errors that can occur. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Chris Metcalf wrote: > On 3/14/2017 12:12 PM, Till Smejkal wrote: > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > > > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal > > > <till.smej...@googlemail.com> wrote: > > > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > > > > > This sounds rather complicated. Getting TLB flushing right seems > > > > > tricky. Why not just map the same thing into multiple mms? > > > > This is exactly what happens at the end. The memory region that is > > > > described by the > > > > VAS segment will be mapped in the ASes that use the segment. > > > So why is this kernel feature better than just doing MAP_SHARED > > > manually in userspace? > > One advantage of VAS segments is that they can be globally queried by user > > programs > > which means that VAS segments can be shared by applications that not > > necessarily have > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > > only work > > if the tasks that share the memory region are related (aka. have a common > > parent that > > initialized the shared mapping). Otherwise, the shared mapping have to be > > backed by a > > file. > > True, but why is this bad? The shared mapping will be memory resident > regardless, even if backed by a file (unless swapped out under heavy > memory pressure, but arguably that's a feature anyway). More importantly, > having a file name is a simple and consistent way of identifying such > shared memory segments. > > With a little work, you can also arrange to map such files into memory > at a fixed address in all participating processes, thus making internal > pointers work correctly. I don't want to say that the interface provided by MAP_SHARED is bad. I am only arguing that VAS segments and the interface that they provide have an advantage over the existing ones in my opinion. However, Matthew Wilcox also suggested in some earlier mail that VAS segments could be exported to user space via a special purpose filesystem. This would enable users of VAS segments to also just use some special files to setup the shared memory regions. But since the VAS segment itself already knows where at has to be mapped in the virtual address space of the process, the establishing of the shared memory region would be very easy for the user. > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > I am not fully up to speed on NV/pmem stuff, but isn't that exactly what > the DAX mode is supposed to allow you to do? If so, isn't sharing a > mapped file on a DAX filesystem on top of pmem equivalent to what > you're proposing? If I read the documentation to DAX filesystems correctly, it is indeed possible to us them to create files that life purely in NVM. I wasn't fully aware of this feature. Thanks for the pointer. However, the main contribution of this patchset is actually the idea of first class virtual address spaces and that they can be used to allow processes to have multiple different views on the system's main memory. For us, VAS segments were another logic step in the same direction (from first class virtual address spaces to first class address space segments). However, if there is already functionality in the Linux kernel to achieve the exact same behavior, there is no real need to add VAS segments. I will continue thinking about them and either find a different situation where the currently available interface is not sufficient/too complicated or drop VAS segments from future version of the patch set. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, 13 Mar 2017, Andy Lutomirski wrote: > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal > <till.smej...@googlemail.com> wrote: > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > >> This sounds rather complicated. Getting TLB flushing right seems > >> tricky. Why not just map the same thing into multiple mms? > > > > This is exactly what happens at the end. The memory region that is > > described by the > > VAS segment will be mapped in the ASes that use the segment. > > So why is this kernel feature better than just doing MAP_SHARED > manually in userspace? One advantage of VAS segments is that they can be globally queried by user programs which means that VAS segments can be shared by applications that not necessarily have to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work if the tasks that share the memory region are related (aka. have a common parent that initialized the shared mapping). Otherwise, the shared mapping have to be backed by a file. VAS segments on the other side allow sharing of pure in memory data by arbitrary related tasks without the need of a file. This becomes especially interesting if one combines VAS segments with non-volatile memory since one can keep data structures in the NVM and still be able to share them between multiple tasks. > >> Ick. Please don't do this. Can we please keep an mm as just an mm > >> and not make it look magically different depending on which process > >> maps it? If you need a trampoline (which you do, of course), just > >> write a trampoline in regular user code and map it manually. > > > > Did I understand you correctly that you are proposing that the switching > > thread > > should make sure by itself that its code, stack, … memory regions are > > properly setup > > in the new AS before/after switching into it? I think, this would make > > using first > > class virtual address spaces much more difficult for user applications to > > the extend > > that I am not even sure if they can be used at all. At the moment, > > switching into a > > VAS is a very simple operation for an application because the kernel will > > just simply > > do the right thing. > > Yes. I think that having the same mm_struct look different from > different tasks is problematic. Getting it right in the arch code is > going to be nasty. The heuristics of what to share are also tough -- > why would text + data + stack or whatever you're doing be adequate? > What if you're in a thread? What if two tasks have their stacks in > the same place? The different ASes that a task now can have when it uses first class virtual address spaces are not realized in the kernel by using only one mm_struct per task that just looks differently but by using multiple mm_structs - one for each AS that the task can execute in. When a task attaches a first class virtual address space to itself to be able to use another AS, the kernel adds a temporary mm_struct to this task that contains the mappings of the first class virtual address space and the one shared with the task's original AS. If a thread now wants to switch into this attached first class virtual address space the kernel only changes the 'mm' and 'active_mm' pointers in the task_struct of the thread to the temporary mm_struct and performs the corresponding mm_switch operation. The original mm_struct of the thread will not be changed. Accordingly, I do not magically make mm_structs look differently depending on the task that uses it, but create temporary mm_structs that only contain mappings to the same memory regions. I agree that finding a good heuristics of what to share is difficult. At the moment, all memory regions that are available in the task's original AS will also be available when a thread switches into an attached first class virtual address space (aka. are shared). That means that VAS can mainly be used to extend the AS of a task in the current state of the implementation. The reason why I implemented the sharing in this way is that I didn't want to break shared libraries. If I only share code+heap+stack, shared libraries would not work anymore after switching into a VAS. > I could imagine something like a sigaltstack() mode that lets you set > a signal up to also switch mm could be useful. This is a very interesting idea. I will keep it in mind for future use cases of multiple virtual address spaces per task. Thanks Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
On Tue, 14 Mar 2017, David Laight wrote: > From: Linuxppc-dev Till Smejkal > > Sent: 13 March 2017 22:14 > > The only way until now to create a new memory map was via the exported > > function 'mm_alloc'. Unfortunately, this function not only allocates a new > > memory map, but also completely initializes it. However, with the > > introduction of first class virtual address spaces, some initialization > > steps done in 'mm_alloc' are not applicable to the memory maps needed for > > this feature and hence would lead to errors in the kernel code. > > > > Instead of introducing a new function that can allocate and initialize > > memory maps for first class virtual address spaces and potentially > > duplicate some code, I decided to split the mm_alloc function as well as > > the 'mm_init' function that it uses. > > > > Now there are four functions exported instead of only one. The new > > 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one > > want to have the old behavior of mm_alloc one can use the newly introduced > > function 'mm_alloc_and_setup' which not only allocates a new mm_struct but > > also fully initializes it. > ... > > That looks like bugs waiting to happen. > You need unchanged code to fail to compile. Thank you for this hint. I can give the new mm_alloc function a different name so that code that uses the *old* mm_alloc function will fail to compile. I just reused the old name when I wrote the code, because mm_alloc was only used in very few locations in the kernel (2 times in the whole kernel source) which made identifying and changing them very easy. I also don't think that there will be many users in the kernel for mm_alloc in the future because it is a relatively low level data structure. But if it is better to use a different name for the new function, I am very happy to change this. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, 13 Mar 2017, Andy Lutomirski wrote: > On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal > <till.smej...@googlemail.com> wrote: > > This patchset extends the kernel memory management subsystem with a new > > type of address spaces (called VAS) which can be created and destroyed > > independently of processes by a user in the system. During its lifetime > > such a VAS can be attached to processes by the user which allows a process > > to have multiple address spaces and thereby multiple, potentially > > different, views on the system's main memory. During its execution the > > threads belonging to the process are able to switch freely between the > > different attached VAS and the process' original AS enabling them to > > utilize the different available views on the memory. > > Sounds like the old SKAS feature for UML. I haven't heard of this feature before, but after shortly looking at the description on the UML website it actually has some similarities with what I am proposing. But as far as I can see this was not merged into the mainline kernel, was it? In addition, I think that first class virtual address spaces goes even one step further by allowing AS to live independently of processes. > > In addition to the concept of first class virtual address spaces, this > > patchset introduces yet another feature called VAS segments. VAS segments > > are memory regions which have a fixed size and position in the virtual > > address space and can be shared between multiple first class virtual > > address spaces. Such shareable memory regions are especially useful for > > in-memory pointer-based data structures or other pure in-memory data. > > This sounds rather complicated. Getting TLB flushing right seems > tricky. Why not just map the same thing into multiple mms? This is exactly what happens at the end. The memory region that is described by the VAS segment will be mapped in the ASes that use the segment. > > > > | VAS | processes | > > - > > switch | 468ns | 1944ns | > > The solution here is IMO to fix the scheduler. IMHO it will be very difficult for the scheduler code to reach the same switching time as the pure VAS switch because switching between VAS does not involve saving any registers or FPU state and does not require selecting the next runnable task. VAS switch is basically a system call that just changes the AS of the current thread which makes it a very lightweight operation. > Also, FWIW, I have patches (that need a little work) that will make > switch_mm() wy faster on x86. These patches will also improve the speed of the VAS switch operation. We are also using the switch_mm function in the background to perform the actual hardware switch between the two ASes. The main reason why the VAS switch is faster than the task switch is that it just has to do fewer things. > > At the current state of the development, first class virtual address spaces > > have one limitation, that we haven't been able to solve so far. The feature > > allows, that different threads of the same process can execute in different > > AS at the same time. This is possible, because the VAS-switch operation > > only changes the active mm_struct for the task_struct of the calling > > thread. However, when a thread switches into a first class virtual address > > space, some parts of its original AS are duplicated into the new one to > > allow the thread to continue its execution at its current state. > > Ick. Please don't do this. Can we please keep an mm as just an mm > and not make it look magically different depending on which process > maps it? If you need a trampoline (which you do, of course), just > write a trampoline in regular user code and map it manually. Did I understand you correctly that you are proposing that the switching thread should make sure by itself that its code, stack, … memory regions are properly setup in the new AS before/after switching into it? I think, this would make using first class virtual address spaces much more difficult for user applications to the extend that I am not even sure if they can be used at all. At the moment, switching into a VAS is a very simple operation for an application because the kernel will just simply do the right thing. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
Hi Vineet, On Mon, 13 Mar 2017, Vineet Gupta wrote: > I've not looked at the patches closely (or read the references paper fully > yet), > but at first glance it seems on ARC architecture, we can can potentially > use/leverage this mechanism to implement the shared TLB entries. Before anyone > shouts these are not same as the IA64/x86 protection keys which allow TLB > entries > with different protection bits across processes etc. These TLB entries are > actually *shared* by processes. > > Conceptually there's shared address spaces, independent of processes. e.g. > ldso > code is shared address space #1, libc (code) #2 System can support a > limited > number of shared addr spaces (say 64, enough for typical embedded sys). > > While Normal TLB entries are tagged with ASID (Addr space ID) to keep them > unique > across processes, Shared TLB entries are tagged with Shared address space ID. > > A process MMU context consists of ASID (a single number) and a SASID bitmap > (to > allow "subscription" to multiple Shared spaces. The subscriptions are set up > bu > userspace ld.so which knows about the libs process wants to map. > > The restriction ofcourse is that the spaces are mapped at *same* vaddr is all > participating processes. I know this goes against whole security, address > space > randomization - but it gives much better real time performance. Why does each > process need to take a MMU exception for libc code... > > So long story short - it seems there can be multiple uses of this > infrastructure ! During the development of this code, we also looked at shared TLB entries, but the other way around. We wanted to use them to prevent flushing of TLB entries of shared memory regions when switching between multiple ASes. Unfortunately, we never finished this part of the code. However, we also investigated into a different use-case for first class virtual address spaces that is related to what you propose if I didn't understand something wrong. The idea is to move shared libraries into their own first class virtual address space and only load some small trampoline code in the application AS. This trampoline code performs the VAS switch in the libraries AS and execute the requested function there. If we combine this architecture with tagged TLB entries to prevent TLB flushes during the switch operation, it can also reach an acceptable performance. A side effect of moving the shared library into its own AS is that it can not be used by ROP-attacks because it is not accessible in the application's AS. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Richard Henderson wrote: > On 03/14/2017 10:39 AM, Till Smejkal wrote: > > > Is this an indication that full virtual address spaces are useless? It > > > would seem like if you only use virtual address segments then you avoid > > > all > > > of the problems with executing code, active stacks, and brk. > > > > What do you mean with *virtual address segments*? The nice part of first > > class > > virtual address spaces is that one can share/reuse collections of address > > space > > segments easily. > > What do *I* mean? You introduced the term, didn't you? > Rereading your original I see you called them "VAS segments". Oh, I am sorry. I thought that you were referring to some other feature that I don't know. > Anyway, whatever they are called, it would seem that these segments do not > require any of the syncing mechanisms that are causing you problems. Yes, VAS segments provide a possibility to share memory regions between multiple address spaces without the need to synchronize heap, stack, etc. Unfortunately, the VAS segment feature itself without the whole concept of first class virtual address spaces is not as powerful. With some additional work it can probably be represented with the existing shmem functionality. The first class virtual address space feature on the other side provides a real benefit for applications in our opinion namely that an application can switch between different views on its memory which enables various interesting programming paradigms as mentioned in the cover letter. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
[RFC PATCH 13/13] fs/proc: Add procfs support for first class virtual address spaces
Add new files and directories to the procfs file system that contain various information about the first class virtual address spaces attach to the processes in the system. To the procfs directories of each process in the system (/proc/$PID) an additional directory with the name 'vas' is added that contains information about all the VAS that are attached to this process. In this directory one can find for each attached VAS a special folder with a file with some status information about the attached VAS, a file with the current memory map of the attached VAS and a link to the sysfs folder of the underlying VAS. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- fs/proc/base.c | 528 + fs/proc/inode.c| 1 + fs/proc/internal.h | 1 + mm/Kconfig | 9 + 4 files changed, 539 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 87c9a9aacda3..e60c13dd087c 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -45,6 +45,9 @@ * * Paul Mundt <paul.mu...@nokia.com>: * Overall revision about smaps. + * + * Till Smejkal <till.smej...@gmail.com>: + * Add entries for first class virtual address spaces. */ #include @@ -87,6 +90,7 @@ #include #include #include +#include #ifdef CONFIG_HARDWALL #include #endif @@ -2841,6 +2845,527 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns, return err; } +#ifdef CONFIG_VAS_PROCFS + +/** + * Get a string representation of the access type to a VAS. + **/ +#define vas_access_type_str(type) ((type) & MAY_WRITE ? \ + ((type) & MAY_READ ? "rw" : "wo") : "ro") + +static int att_vas_show_status(struct seq_file *sf, void *unused) +{ + struct inode *inode = sf->private; + struct proc_inode *pi = PROC_I(inode); + struct task_struct *tsk; + struct vas_context *vas_ctx; + struct att_vas *avas; + int vid = pi->vas_id; + + tsk = get_proc_task(inode); + if (!tsk) + return -ENOENT; + + vas_ctx = tsk->vas_ctx; + + vas_context_lock(vas_ctx); + + list_for_each_entry(avas, _ctx->vases, tsk_link) { + if (vid == avas->vas->id) + goto good_att_vas; + } + + vas_context_unlock(vas_ctx); + put_task_struct(tsk); + + return -ENOENT; + +good_att_vas: + seq_printf(sf, + "pid: %d\n" + "vid: %d\n" + "type: %s\n", + avas->tsk->pid, avas->vas->id, + vas_access_type_str(avas->type)); + + vas_context_unlock(vas_ctx); + put_task_struct(tsk); + + return 0; +} + +static int att_vas_show_status_open(struct inode *inode, struct file *file) +{ + return single_open(file, att_vas_show_status, inode); +} + +static const struct file_operations att_vas_show_status_fops = { + .open = att_vas_show_status_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + +static int att_vas_show_mappings(struct seq_file *sf, void *unused) +{ + struct inode *inode = sf->private; + struct proc_inode *pi = PROC_I(inode); + struct task_struct *tsk; + struct vas_context *vas_ctx; + struct att_vas *avas; + struct mm_struct *mm; + struct vm_area_struct *vma; + int vid = pi->vas_id; + + tsk = get_proc_task(inode); + if (!tsk) + return -ENOENT; + + vas_ctx = tsk->vas_ctx; + + vas_context_lock(vas_ctx); + + list_for_each_entry(avas, _ctx->vases, tsk_link) { + if (avas->vas->id == vid) + goto good_att_vas; + } + + vas_context_unlock(vas_ctx); + put_task_struct(tsk); + + return -ENOENT; + +good_att_vas: + mm = avas->mm; + + down_read(>mmap_sem); + + if (!mm->mmap) { + seq_puts(sf, "EMPTY\n"); + goto out_unlock; + } + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + vm_flags_t flags = vma->vm_flags; + struct file *file = vma->vm_file; + unsigned long long pgoff = 0; + + if (file) + pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; + + seq_printf(sf, "%08lx-%08lx %c%c%c%c [%c:%c] %08llx", + vma->vm_start, vma->vm_end, + flags & VM_READ ? 'r' : '-', + flags & VM_WRITE ? 'w' : '-', + flags & VM_EXEC ? 'x' : '-', + flags & VM_MAYSHARE ? 's' : 'p', +
[RFC PATCH 10/13] mm: Introduce first class virtual address spaces
Introduce a different type of address spaces which are first class citizens in the OS. That means that the kernel now handles two types of AS, those which are closely coupled with a process and those which aren't. While the former ones are created and destroyed together with the process by the kernel and are the default type of AS in the Linux kernel, the latter ones have to be managed explicitly by the user and are the newly introduced type. Accordingly, a first class AS (also called VAS == virtual address space) can exist in the OS independently from any process. A user has to explicitly create and destroy them in the system. Processes and VAS can be combined by attaching a previously created VAS to a process which basically adds an additional AS to the process that the process' threads are able to execute in. Hence, VAS allow a process to have different views onto the main memory of the system (its original AS and the attached VAS) between which its threads can switch arbitrarily during their lifetime. The functionality made available through first class virtual address spaces can be used in various different ways. One possible way to utilize VAS is to compartmentalize a process for security reasons. Another possible usage is to improve the performance of data-centric applications by being able to manage different sets of data in memory without the need to map or unmap them. Furthermore, first class virtual address spaces can be attached to different processes at the same time if the underlying memory is only readable. This mechanism allows sharing of whole address spaces between multiple processes that can both execute in them using the contained memory. Signed-off-by: Till Smejkal <till.smej...@gmail.com> Signed-off-by: Marco Benatto <marco.antonio@gmail.com> --- MAINTAINERS| 10 + arch/x86/entry/syscalls/syscall_32.tbl |9 + arch/x86/entry/syscalls/syscall_64.tbl |9 + fs/exec.c |3 + include/linux/mm_types.h |8 + include/linux/sched.h | 17 + include/linux/syscalls.h | 11 + include/linux/vas.h| 182 +++ include/linux/vas_types.h | 88 ++ include/uapi/asm-generic/unistd.h | 20 +- include/uapi/linux/Kbuild |1 + include/uapi/linux/vas.h | 16 + init/main.c|2 + kernel/exit.c |2 + kernel/fork.c | 28 +- kernel/sys_ni.c| 11 + mm/Kconfig | 20 + mm/Makefile|1 + mm/internal.h |8 + mm/memory.c|3 + mm/mmap.c | 22 + mm/vas.c | 2188 22 files changed, 2657 insertions(+), 2 deletions(-) create mode 100644 include/linux/vas.h create mode 100644 include/linux/vas_types.h create mode 100644 include/uapi/linux/vas.h create mode 100644 mm/vas.c diff --git a/MAINTAINERS b/MAINTAINERS index 527d13759ecc..060b1c64e67a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5040,6 +5040,16 @@ F: Documentation/firmware_class/ F: drivers/base/firmware*.c F: include/linux/firmware.h +FIRST CLASS VIRTUAL ADDRESS SPACES +M: Till Smejkal <till.smej...@gmail.com> +L: linux-ker...@vger.kernel.org +L: linux...@kvack.org +S: Maintained +F: include/linux/vas_types.h +F: include/linux/vas.h +F: include/uapi/linux/vas.h +F: mm/vas.c + FLASH ADAPTER DRIVER (IBM Flash Adapter 900GB Full Height PCI Flash Card) M: Joshua Morris <josh.h.mor...@us.ibm.com> M: Philip Kelleher <pjk1...@linux.vnet.ibm.com> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2b3618542544..8c553eef8c44 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -389,3 +389,12 @@ 380i386pkey_mprotect sys_pkey_mprotect 381i386pkey_alloc sys_pkey_alloc 382i386pkey_free sys_pkey_free +383i386vas_create sys_vas_create +384i386vas_delete sys_vas_delete +385i386vas_findsys_vas_find +386i386vas_attach sys_vas_attach +387i386vas_detach sys_vas_detach +388i386vas_switch sys_vas_switch +389i386active_vas sys_active_vas +390i386vas_getattr sys_vas_getattr +391i386vas_setattr sys_vas_setattr diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index e93ef0b38db8..72f1f0495710 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64
[RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces
Until now, whenever a task attaches a first class virtual address space, all the memory regions currently present in the task are replicated into the first class virtual address space so that the task can continue executing as if nothing has changed. However, this technique causes the attach and detach operations to be very costly, since the whole memory map of the task has to be duplicated. Lazy-attaching on the other side uses a similar technique as it is done to copy page tables during fork. Instead of completely duplicating the memory map of the task together with its page tables, only a skeleton memory map is created and then later filled with content when a page fault is triggered when the process actually accesses the memory regions. The big advantage is, that unnecessary memory regions are not duplicated at all, but just those that the process actually uses while executing inside the first class virtual address space. The only memory region which is always duplicated during the attach-operation is the code memory section, because this memory region is always necessary for execution and saves us one page fault later during the process execution. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- include/linux/mm_types.h | 1 + include/linux/vas.h | 26 mm/Kconfig | 18 ++ mm/memory.c | 5 ++ mm/vas.c | 164 ++- 5 files changed, 197 insertions(+), 17 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 82bf78ea83ee..65e04f14225d 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -362,6 +362,7 @@ struct vm_area_struct { #ifdef CONFIG_VAS struct mm_struct *vas_reference; ktime_t vas_last_update; + bool vas_attached; #endif }; diff --git a/include/linux/vas.h b/include/linux/vas.h index 376b9fa1ee27..8682bfc86568 100644 --- a/include/linux/vas.h +++ b/include/linux/vas.h @@ -2,6 +2,7 @@ #define _LINUX_VAS_H +#include #include #include @@ -293,4 +294,29 @@ static inline int vas_exit(struct task_struct *tsk) { return 0; } #endif /* CONFIG_VAS */ + +/*** + * Management of the VAS lazy attaching + ***/ + +#ifdef CONFIG_VAS_LAZY_ATTACH + +/** + * Lazily update the page tables of a vm_area which was not completely setup + * during the VAS attaching. + * + * @param[in] vma: The vm_area for which the page tables should be + * setup before continuing the page fault handling. + * + * @returns: 0 of the lazy-attach was successful or not + * necessary, or 1 if something went wrong. + */ +extern int vas_lazy_attach_vma(struct vm_area_struct *vma); + +#else /* CONFIG_VAS_LAZY_ATTACH */ + +static inline int vas_lazy_attach_vma(struct vm_area_struct *vma) { return 0; } + +#endif /* CONFIG_VAS_LAZY_ATTACH */ + #endif diff --git a/mm/Kconfig b/mm/Kconfig index 9a80877f3536..934c56bcdbf4 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -720,6 +720,24 @@ config VAS If not sure, then say N. +config VAS_LAZY_ATTACH + bool "Use lazy-attach for First Class Virtual Address Spaces" + depends on VAS + default y + help + When this option is enabled, memory regions of First Class Virtual + Address Spaces will be mapped in the task's address space lazily after + the switch happened. That means, the actual mapping will happen when a + page fault occurs for the particular memory region. While this + technique is less costly during the switching operation, it can become + very costly during the page fault handling. + + Hence if the program uses a lot of different memory regions, this + lazy-attaching technique can be more costly than doing the mapping + eagerly during the switch. + + If not sure, then say Y. + config VAS_DEBUG bool "Debugging output for First Class Virtual Address Spaces" depends on VAS diff --git a/mm/memory.c b/mm/memory.c index e4747b3fd5b9..cdefc99a50ac 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -64,6 +64,7 @@ #include #include #include +#include #include #include @@ -4000,6 +4001,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, /* do counter updates before entering really critical section. */ check_sync_rss_stat(current); + /* Check if this VMA belongs to a VAS and needs to be lazy attached. */ + if (unlikely(vas_lazy_attach_vma(vma))) + return VM_FAULT_SIGSEGV; + /* * Enable the memcg OOM handling for faults triggered in user * space. Kernel faults are handled more gracefully. diff --git a/mm/vas.c b/mm/vas.c index 345b023c21aa..953ba8d6e603 100644 --- a/mm/vas.c +++ b/mm/vas.c @@ -138,12 +138,13 @@ static void __dump_memory_map(const char *ti
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Richard Henderson wrote: > On 03/14/2017 08:14 AM, Till Smejkal wrote: > > At the current state of the development, first class virtual address spaces > > have one limitation, that we haven't been able to solve so far. The feature > > allows, that different threads of the same process can execute in different > > AS at the same time. This is possible, because the VAS-switch operation > > only changes the active mm_struct for the task_struct of the calling > > thread. However, when a thread switches into a first class virtual address > > space, some parts of its original AS are duplicated into the new one to > > allow the thread to continue its execution at its current state. > > Accordingly, parts of the processes AS (e.g. the code section, data > > section, heap section and stack sections) exist in multiple AS if the > > process has a VAS attached to it. Changes to these shared memory regions > > are synchronized between the address spaces whenever a thread switches > > between two of them. Unfortunately, in some scenarios the kernel is not > > able to properly synchronize all these shared memory regions because of > > conflicting changes. One such example happens if there are two threads, one > > executing in an attached first class virtual address space, the other in > > the tasks original address space. If both threads make changes to the heap > > section that cause expansion of the underlying vm_area_struct, the kernel > > cannot correctly synchronize these changes, because that would cause parts > > of the virtual address space to be overwritten with unrelated data. In the > > current implementation such conflicts are only detected but not resolved > > and result in an error code being returned by the kernel during the VAS > > switch operation. Unfortunately, that means for the particular thread that > > tried to make the switch, that it cannot do this anymore in the future and > > accordingly has to be killed. > > This sounds like a fairly fundamental problem to me. Yes I agree. This is a significant limitation of first class virtual address spaces. However, conflict like this can be mitigated by being careful in the application that uses multiple first class virtual address spaces. If all threads make sure that they never resize shared memory regions when executing inside a VAS such conflicts do not occur. Another possibility that I investigated but not yet finished is that such resizes of shared memory regions have to be synchronized more frequently than just at every switch between VASes. If one for example "forward" memory region resizes to all AS that share this particular memory region during the resize operation, one can completely eliminate this problem. Unfortunately, this introduces a significant cost and introduces a difficult to handle race condition. > Is this an indication that full virtual address spaces are useless? It > would seem like if you only use virtual address segments then you avoid all > of the problems with executing code, active stacks, and brk. What do you mean with *virtual address segments*? The nice part of first class virtual address spaces is that one can share/reuse collections of address space segments easily. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
[RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork
The dup_mm-function used during 'do_fork' to duplicate the current task's mm_struct for the newly forked task always implicitly uses current->mm for this purpose. However, during copy_mm it was already decided which mm_struct to copy/duplicate. So pass this mm_struct to dup_mm instead of again deciding which mm_struct to use. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- kernel/fork.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 9209f6d5d7c0..d3087d870855 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1158,9 +1158,10 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) * Allocate a new mm structure and copy contents from the * mm structure of the passed in task structure. */ -static struct mm_struct *dup_mm(struct task_struct *tsk) +static struct mm_struct *dup_mm(struct task_struct *tsk, + struct mm_struct *oldmm) { - struct mm_struct *mm, *oldmm = current->mm; + struct mm_struct *mm; int err; mm = allocate_mm(); @@ -1226,7 +1227,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) } retval = -ENOMEM; - mm = dup_mm(tsk); + mm = dup_mm(tsk, oldmm); if (!mm) goto fail_nomem; -- 2.12.0 ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions
Hi Matthew, On Mon, 13 Mar 2017, Matthew Wilcox wrote: > On Mon, Mar 13, 2017 at 03:14:13PM -0700, Till Smejkal wrote: > > +/** > > + * Create a new VAS segment. > > + * > > + * @param[in] name:The name of the new VAS segment. > > + * @param[in] start: The address where the VAS segment > > begins. > > + * @param[in] end: The address where the VAS segment ends. > > + * @param[in] mode:The access rights for the VAS segment. > > + * > > + * @returns: The VAS segment ID on success, -ERRNO > > otherwise. > > + **/ > > Please follow the kernel-doc conventions, as described in > Documentation/doc-guide/kernel-doc.rst. Also, function documentation > goes with the implementation, not the declaration. Thank you for this pointer. I wasn't aware of this convention. I will change the patches accordingly. > > +/** > > + * Get ID of the VAS segment belonging to a given name. > > + * > > + * @param[in] name:The name of the VAS segment for which > > the ID > > + * should be returned. > > + * > > + * @returns: The VAS segment ID on success, -ERRNO > > + * otherwise. > > + **/ > > +extern int vas_seg_find(const char *name); > > So ... segments have names, and IDs ... and access permissions ... > Why isn't this a special purpose filesystem? We also thought about this. However, we decided against implementing them as a special purpose filesystem, mainly because we could not think of a good way to represent a VAS/VAS segment in this file system (should they be represented rather as file or directory) and we weren't sure what a hierarchy in the filesystem would mean for the underlying address spaces. Hence we decided against it and rather used a combination of IDR and sysfs. However, I don't have any strong feelings and would also reimplement them as a special purpose filesystem if people rather like them to be one. Till ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
[RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem
Make the functions 'vma_link' and 'find_vma_links' accessible to other source files in the mm/ source directory of the kernel so that other files in that directory can also perform low level changes to mm_struct data structures. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- mm/internal.h | 11 +++ mm/mmap.c | 12 ++-- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 7aa2ea0a8623..e22cb031b45b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -76,6 +76,17 @@ static inline void set_page_refcounted(struct page *page) extern unsigned long highest_memmap_pfn; /* + * in mm/mmap.c + */ +extern void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, +struct vm_area_struct *prev, struct rb_node **rb_link, +struct rb_node *rb_parent); +extern int find_vma_links(struct mm_struct *mm, unsigned long addr, + unsigned long end, struct vm_area_struct **pprev, + struct rb_node ***rb_link, + struct rb_node **rb_parent); + +/* * in mm/vmscan.c: */ extern int isolate_lru_page(struct page *page); diff --git a/mm/mmap.c b/mm/mmap.c index 3f60c8ebd6b6..d35c6b51cadf 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -466,9 +466,9 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma) anon_vma_interval_tree_insert(avc, >anon_vma->rb_root); } -static int find_vma_links(struct mm_struct *mm, unsigned long addr, - unsigned long end, struct vm_area_struct **pprev, - struct rb_node ***rb_link, struct rb_node **rb_parent) +int find_vma_links(struct mm_struct *mm, unsigned long addr, + unsigned long end, struct vm_area_struct **pprev, + struct rb_node ***rb_link, struct rb_node **rb_parent) { struct rb_node **__rb_link, *__rb_parent, *rb_prev; @@ -580,9 +580,9 @@ __vma_link(struct mm_struct *mm, struct vm_area_struct *vma, __vma_link_rb(mm, vma, rb_link, rb_parent); } -static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, - struct vm_area_struct *prev, struct rb_node **rb_link, - struct rb_node *rb_parent) +void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, + struct vm_area_struct *prev, struct rb_node **rb_link, + struct rb_node *rb_parent) { struct address_space *mapping = NULL; -- 2.12.0 ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
Hi Greg, First of all thanks for your reply. On Tue, 14 Mar 2017, Greg Kroah-Hartman wrote: > On Mon, Mar 13, 2017 at 03:14:12PM -0700, Till Smejkal wrote: > > There's no way with that many cc: lists and people that this is really > making it through very many people's filters and actually on a mailing > list. Please trim them down. I am sorry that the patch's cc-list is too big. This was the list of people that the get_maintainers.pl script produced. I already recognized that it was a huge number of people, but I didn't want to remove anyone from the list because I wasn't sure who would be interested in this patch set. Do you have any suggestion who to remove from the list? I don't want to annoy anyone with useless emails. > Minor sysfs questions/issues: > > > +struct vas { > > + struct kobject kobj;/* < the internal kobject that we use * > > +* for reference counting and sysfs * > > +* handling.*/ > > + > > + int id; /* < ID */ > > + char name[VAS_MAX_NAME_LENGTH]; /* < name */ > > The kobject has a name, why not use that? The reason why I don't use the kobject's name is that I don't restrict the names that are used for VAS/VAS segments. Accordingly, it would be allowed to use a name like "foo/bar/xyz" as VAS name. However, I am not sure what would happen in the sysfs if I would use such a name for the kobject. Especially, since one could think of another VAS with the name "foo/bar" whose name would conflict with the first one although it not necessarily has any connection with it. > > + > > + struct mutex mtx; /* < lock for parallel access.*/ > > + > > + struct mm_struct *mm; /* < a partial memory map containing * > > +* all mappings of this VAS.*/ > > + > > + struct list_head link; /* < the link in the global VAS list. */ > > + struct rcu_head rcu;/* < the RCU helper used for * > > +* asynchronous VAS deletion. */ > > + > > + u16 refcount; /* < how often is the VAS attached. */ > > The kobject has a refcount, use that? Don't have 2 refcounts in the > same structure, that way lies madness. And bugs, lots of bugs... > > And if this really is a refcount (hint, I don't think it is), you should > use the refcount_t type. I actually use both the internal kobject refcount to keep track of how often a VAS/VAS segment is referenced and this 'refcount' variable to keep track how often the VAS is actually attached to a task. They not necessarily must be related to each other. I can rename this variable to attach_count. Or if preferred I can alternatively only use the kobject reference counter and remove this variable completely though I would loose information about how often the VAS is attached to a task because the kobject reference counter is also used to keep track of other variables referencing the VAS. > > +/** > > + * The sysfs structure we need to handle attributes of a VAS. > > + **/ > > +struct vas_sysfs_attr { > > + struct attribute attr; > > + ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr, > > + char *buf); > > + ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr, > > +const char *buf, size_t count); > > +}; > > + > > +#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE) > > \ > > +static struct vas_sysfs_attr vas_sysfs_attr_##NAME = > > \ > > + __ATTR(NAME, MODE, SHOW, STORE) > > __ATTR_RO and __ATTR_RW should work better for you. If you really need > this. Thank you. I will have a look at these functions. > Oh, and where is the Documentation/ABI/ updates to try to describe the > sysfs structure and files? Did I miss that in the series? Oh sorry, I forgot to add this file. I will add the ABI descriptions for future submissions. > > +static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr > > *vsattr, > > + char *buf) > > +{ > > + return scnprintf(buf, PAGE_SIZE, "%s", vas->name); > > It's a page size, just use sprintf() and be done with it. No need to > ever check, you "know" it will be correct. OK. I was following the sysfs example in the documentation that used scnprintf, but if sprintf is preferred, I can change this. > Also, what about a trailing '\n' f
[RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges
Add new function to one-to-one duplicate a page table range of one memory map to another memory map. The new function 'dup_page_range' copies the page table entries for the specified region from the page table of the source memory map to the page table of the destination memory map and thereby allows actual sharing of the referenced memory pages instead of relying on copy-on-write for anonymous memory pages or page faults for read-only memory pages as it is done by the existing function 'copy_page_range'. Hence, 'dup_page_range' will produce shared pages between two address spaces whereas 'copy_page_range' will result in copies of pages if necessary. Preexisting mappings in the page table of the destination memory map are properly zapped by the 'dup_page_range' function if they differ from the ones in the source memory map before they are replaced with the new ones. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- include/linux/huge_mm.h | 6 + include/linux/hugetlb.h | 5 + include/linux/mm.h | 2 + mm/huge_memory.c| 65 +++ mm/hugetlb.c| 205 +++-- mm/memory.c | 461 +--- 6 files changed, 620 insertions(+), 124 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 94a0e680b7d7..52c0498426ef 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct vm_fault *vmf); extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); +extern int dup_huge_pmd(struct mm_struct *dst_mm, + struct vm_area_struct *dst_vma, + struct mm_struct *src_mm, + struct vm_area_struct *src_vma, + struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd, + unsigned long addr); extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); extern int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd); extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 72260cc252f2..d8eb682e39a1 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -63,6 +63,10 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, #endif int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *); +int dup_hugetlb_page_range(struct mm_struct *dst_mm, + struct vm_area_struct *dst_vma, + struct mm_struct *src_mm, + struct vm_area_struct *src_vma); long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, unsigned long *, long, unsigned int); @@ -134,6 +138,7 @@ static inline unsigned long hugetlb_total_pages(void) #define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) +#define dup_hugetlb_page_range(dst, dst_vma, src, src_vma) ({ BUG(); 0; }) static inline void hugetlb_report_meminfo(struct seq_file *m) { } diff --git a/include/linux/mm.h b/include/linux/mm.h index 92925d97da20..b39ec795f64c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1208,6 +1208,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); +int dup_page_range(struct mm_struct *dst, struct vm_area_struct *dst_vma, + struct mm_struct *src, struct vm_area_struct *src_vma); void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); int follow_pte_pmd(struct mm_struct *mm, unsigned long address, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d5b2604867e5..1edf8c6d1814 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -887,6 +887,71 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, return ret; } +int dup_huge_pmd(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, +struct mm_struct *src_mm, struct vm_area_struct *src_vma, +struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd, +unsigned long addr) +{ + spinlock_t *dst_ptl, *src_ptl; + struct page *page; + pmd_t pmd; + pgtable_t pgtable; + int ret; + + pgtable = pte_alloc_one(dst_mm, addr); + if (!pgtable) +
[RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area'
Add the mm_struct that for which an unmapped area should be found as explicit argument to the 'get_unmapped_area' function. Previously, the function simply search for an unmapped area in the memory map of the current task. However, with the introduction of first class virtual address spaces, it is necessary that get_unmapped_area also can look for unmapped area in memory maps other than the one of the current task. Changing the signature of the generic 'get_unmapped_area' function also requires that all the 'arch_get_unmapped_area' functions as well as the 'vm_unmapped_area' function with its dependents have to take the memory map that they should work on as additional argument. Simply using the one of the current task, as these functions did before, is not correct anymore and leads to incorrect results. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- arch/alpha/kernel/osf_sys.c | 19 ++-- arch/arc/mm/mmap.c | 8 ++--- arch/arm/kernel/process.c| 2 +- arch/arm/mm/mmap.c | 19 ++-- arch/arm64/kernel/vdso.c | 2 +- arch/blackfin/include/asm/pgtable.h | 3 +- arch/blackfin/kernel/sys_bfin.c | 5 ++-- arch/frv/mm/elf-fdpic.c | 11 +++ arch/hexagon/kernel/vdso.c | 2 +- arch/ia64/kernel/perfmon.c | 3 +- arch/ia64/kernel/sys_ia64.c | 6 ++-- arch/ia64/mm/hugetlbpage.c | 7 +++-- arch/metag/mm/hugetlbpage.c | 11 +++ arch/mips/kernel/vdso.c | 2 +- arch/mips/mm/mmap.c | 27 + arch/parisc/kernel/sys_parisc.c | 19 ++-- arch/parisc/mm/hugetlbpage.c | 7 +++-- arch/powerpc/include/asm/book3s/64/hugetlb.h | 6 ++-- arch/powerpc/include/asm/page_64.h | 3 +- arch/powerpc/kernel/vdso.c | 2 +- arch/powerpc/mm/hugetlbpage-radix.c | 9 +++--- arch/powerpc/mm/hugetlbpage.c| 9 +++--- arch/powerpc/mm/mmap.c | 17 +-- arch/powerpc/mm/slice.c | 25 arch/s390/kernel/vdso.c | 3 +- arch/s390/mm/mmap.c | 42 +- arch/sh/kernel/vsyscall/vsyscall.c | 2 +- arch/sh/mm/mmap.c| 19 ++-- arch/sparc/include/asm/pgtable_64.h | 4 +-- arch/sparc/kernel/sys_sparc_32.c | 6 ++-- arch/sparc/kernel/sys_sparc_64.c | 31 +++- arch/sparc/mm/hugetlbpage.c | 26 arch/tile/kernel/vdso.c | 2 +- arch/tile/mm/hugetlbpage.c | 26 arch/x86/entry/vdso/vma.c| 2 +- arch/x86/kernel/sys_x86_64.c | 19 ++-- arch/x86/mm/hugetlbpage.c| 26 arch/xtensa/kernel/syscall.c | 7 +++-- drivers/char/mem.c | 15 ++ drivers/dax/dax.c| 10 +++ drivers/media/usb/uvc/uvc_v4l2.c | 6 ++-- drivers/media/v4l2-core/v4l2-dev.c | 8 ++--- drivers/media/v4l2-core/videobuf2-v4l2.c | 5 ++-- drivers/mtd/mtdchar.c| 3 +- drivers/usb/gadget/function/uvc_v4l2.c | 3 +- fs/hugetlbfs/inode.c | 8 ++--- fs/proc/inode.c | 10 +++ fs/ramfs/file-mmu.c | 5 ++-- fs/ramfs/file-nommu.c| 10 --- fs/romfs/mmap-nommu.c| 3 +- include/linux/fs.h | 2 +- include/linux/huge_mm.h | 6 ++-- include/linux/hugetlb.h | 5 ++-- include/linux/mm.h | 16 ++ include/linux/mm_types.h | 7 +++-- include/linux/sched.h| 10 +++ include/linux/shmem_fs.h | 5 ++-- include/media/v4l2-dev.h | 3 +- include/media/videobuf2-v4l2.h | 5 ++-- ipc/shm.c| 10 +++ kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 18 +++- mm/mmap.c| 44 ++-- mm/mremap.c | 11 +++ mm/nommu.c | 10 --- mm/shmem.c | 14 - sound/core/pcm_native.c | 3 +- 67 files changed, 370 insertions(+), 326 deletions(-) diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index 54d8616644e2..281109bcdc5d
[RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions
VAS segments are an extension to first class virtual address spaces that can be used to share specific memory regions between multiple first class virtual address spaces. VAS segments have a specific size and position in a virtual address space and can thereby be used to share in-memory pointer based data structures between multiple address spaces as well as other in-memory data without the need to represent them in mmap-able files or use shmem. Similar to first class virtual address spaces, VAS segments must be created and destroyed explicitly by a user. The system will never automatically destroy or create a virtual segment. Via attaching a VAS segment to a first class virtual address space, the memory that is contained in the VAS segment can be accessed and changed. Signed-off-by: Till Smejkal <till.smej...@gmail.com> Signed-off-by: Marco Benatto <marco.antonio@gmail.com> --- arch/x86/entry/syscalls/syscall_32.tbl |7 + arch/x86/entry/syscalls/syscall_64.tbl |7 + include/linux/syscalls.h | 10 + include/linux/vas.h| 114 +++ include/linux/vas_types.h | 91 ++- include/uapi/asm-generic/unistd.h | 16 +- include/uapi/linux/vas.h | 12 + kernel/sys_ni.c|7 + mm/vas.c | 1234 ++-- 9 files changed, 1451 insertions(+), 47 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 8c553eef8c44..a4f91d14a856 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,10 @@ 389i386active_vas sys_active_vas 390i386vas_getattr sys_vas_getattr 391i386vas_setattr sys_vas_setattr +392i386vas_seg_create sys_vas_seg_create +393i386vas_seg_delete sys_vas_seg_delete +394i386vas_seg_findsys_vas_seg_find +395i386vas_seg_attach sys_vas_seg_attach +396i386vas_seg_detach sys_vas_seg_detach +397i386vas_seg_getattr sys_vas_seg_getattr +398i386vas_seg_setattr sys_vas_seg_setattr diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 72f1f0495710..a0f9503c3d28 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -347,6 +347,13 @@ 338common active_vas sys_active_vas 339common vas_getattr sys_vas_getattr 340common vas_setattr sys_vas_setattr +341common vas_seg_create sys_vas_seg_create +342common vas_seg_delete sys_vas_seg_delete +343common vas_seg_findsys_vas_seg_find +344common vas_seg_attach sys_vas_seg_attach +345common vas_seg_detach sys_vas_seg_detach +346common vas_seg_getattr sys_vas_seg_getattr +347common vas_seg_setattr sys_vas_seg_setattr # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index fdea27d37c96..7380dcdc4bc1 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -66,6 +66,7 @@ struct perf_event_attr; struct file_handle; struct sigaltstack; struct vas_attr; +struct vas_seg_attr; union bpf_attr; #include @@ -914,4 +915,13 @@ asmlinkage long sys_active_vas(void); asmlinkage long sys_vas_getattr(int vid, struct vas_attr __user *attr); asmlinkage long sys_vas_setattr(int vid, struct vas_attr __user *attr); +asmlinkage long sys_vas_seg_create(const char __user *name, unsigned long start, + unsigned long end, umode_t mode); +asmlinkage long sys_vas_seg_delete(int sid); +asmlinkage long sys_vas_seg_find(const char __user *name); +asmlinkage long sys_vas_seg_attach(int vid, int sid, int type); +asmlinkage long sys_vas_seg_detach(int vid, int sid); +asmlinkage long sys_vas_seg_getattr(int sid, struct vas_seg_attr __user *attr); +asmlinkage long sys_vas_seg_setattr(int sid, struct vas_seg_attr __user *attr); + #endif diff --git a/include/linux/vas.h b/include/linux/vas.h index 6a72e42f96d2..376b9fa1ee27 100644 --- a/include/linux/vas.h +++ b/include/linux/vas.h @@ -138,6 +138,120 @@ extern int vas_setattr(int vid, struct vas_attr *attr); /*** + * Management of VAS segments + ***/ + +/** + * Lock and unlock helper for VAS segments. + **/ +#define vas_seg_lock(seg) mutex_lock(&(seg)->mtx) +#define vas_seg_unlock(seg) mutex_unlock(&(seg)->mtx) + +/** + * Create a new VAS segment. + * + * @param[in] name:The name of the new VAS segment. + * @param[in] start: The address where the VAS segment begins. + * @param[in] end: The address where the VAS segment ends. + * @param[in] mode:The access
[RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
The only way until now to create a new memory map was via the exported function 'mm_alloc'. Unfortunately, this function not only allocates a new memory map, but also completely initializes it. However, with the introduction of first class virtual address spaces, some initialization steps done in 'mm_alloc' are not applicable to the memory maps needed for this feature and hence would lead to errors in the kernel code. Instead of introducing a new function that can allocate and initialize memory maps for first class virtual address spaces and potentially duplicate some code, I decided to split the mm_alloc function as well as the 'mm_init' function that it uses. Now there are four functions exported instead of only one. The new 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one want to have the old behavior of mm_alloc one can use the newly introduced function 'mm_alloc_and_setup' which not only allocates a new mm_struct but also fully initializes it. The old 'mm_init' function which fully initialized a mm_struct was split up into two separate functions. The first one - 'mm_setup' - does all the initialization of the mm_struct that is not related to the mm_struct belonging to a particular task. This part of the initialization is done in the 'mm_set_task' function. This way it is possible to create memory maps that don't have any task-specific information as needed by the first class virtual address space feature. Both functions, 'mm_setup' and 'mm_set_task' are also exported, so that they can be used in all files in the source tree. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- arch/arm/mach-rpc/ecard.c | 2 +- fs/exec.c | 2 +- include/linux/sched.h | 7 +- kernel/fork.c | 64 +-- 4 files changed, 59 insertions(+), 16 deletions(-) diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c index dc67a7fb3831..15845e8abd7e 100644 --- a/arch/arm/mach-rpc/ecard.c +++ b/arch/arm/mach-rpc/ecard.c @@ -245,7 +245,7 @@ static void ecard_init_pgtables(struct mm_struct *mm) static int ecard_init_mm(void) { - struct mm_struct * mm = mm_alloc(); + struct mm_struct *mm = mm_alloc_and_setup(); struct mm_struct *active_mm = current->active_mm; if (!mm) diff --git a/fs/exec.c b/fs/exec.c index e57946610733..68d7908a1e5a 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -380,7 +380,7 @@ static int bprm_mm_init(struct linux_binprm *bprm) int err; struct mm_struct *mm = NULL; - bprm->mm = mm = mm_alloc(); + bprm->mm = mm = mm_alloc_and_setup(); err = -ENOMEM; if (!mm) goto err; diff --git a/include/linux/sched.h b/include/linux/sched.h index 42b9b93a50ac..7955adc00397 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2922,7 +2922,12 @@ static inline unsigned long sigsp(unsigned long sp, struct ksignal *ksig) /* * Routines for handling mm_structs */ -extern struct mm_struct * mm_alloc(void); +extern struct mm_struct *mm_setup(struct mm_struct *mm); +extern struct mm_struct *mm_set_task(struct mm_struct *mm, +struct task_struct *p, +struct user_namespace *user_ns); +extern struct mm_struct *mm_alloc(void); +extern struct mm_struct *mm_alloc_and_setup(void); /* mmdrop drops the mm and the page tables */ extern void __mmdrop(struct mm_struct *); diff --git a/kernel/fork.c b/kernel/fork.c index 11c5c8ab827c..9209f6d5d7c0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -747,8 +747,10 @@ static void mm_init_owner(struct mm_struct *mm, struct task_struct *p) #endif } -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, - struct user_namespace *user_ns) +/** + * Initialize all the task-unrelated fields of a mm_struct. + **/ +struct mm_struct *mm_setup(struct mm_struct *mm) { mm->mmap = NULL; mm->mm_rb = RB_ROOT; @@ -767,24 +769,37 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, spin_lock_init(>page_table_lock); mm_init_cpumask(mm); mm_init_aio(mm); - mm_init_owner(mm, p); mmu_notifier_mm_init(mm); clear_tlb_flush_pending(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS mm->pmd_huge_pte = NULL; #endif + mm->flags = default_dump_filter; + mm->def_flags = 0; + + if (mm_alloc_pgd(mm)) + goto fail_nopgd; + + return mm; + +fail_nopgd: + free_mm(mm); + return NULL; +} + +/** + * Initialize all the task-related fields of a mm_struct. + **/ +struct mm_struct *mm_set_task(struct mm_struct *mm, struct task_struct *p, + struct user_namespace *user_ns) +{ if (current->mm) { mm->flags = current->mm->flags & M
[RFC PATCH 00/13] Introduce first class virtual address spaces
First class virtual address spaces (also called VAS) are a new functionality of the Linux kernel allowing address spaces to exist independently of processes. The general idea behind this feature is described in a paper at ASPLOS16 with the title 'SpaceJMP: Programming with Multiple Virtual Address Spaces' [1]. This patchset extends the kernel memory management subsystem with a new type of address spaces (called VAS) which can be created and destroyed independently of processes by a user in the system. During its lifetime such a VAS can be attached to processes by the user which allows a process to have multiple address spaces and thereby multiple, potentially different, views on the system's main memory. During its execution the threads belonging to the process are able to switch freely between the different attached VAS and the process' original AS enabling them to utilize the different available views on the memory. These multiple virtual address spaces per process and the possibility to switch between them freely can be used in multiple interesting ways as also outlined in the mentioned paper. Some of the many possible applications are for example to compartmentalize a process for security reasons, to improve the performance of data-centric applications and to introduce new application models [1]. In addition to the concept of first class virtual address spaces, this patchset introduces yet another feature called VAS segments. VAS segments are memory regions which have a fixed size and position in the virtual address space and can be shared between multiple first class virtual address spaces. Such shareable memory regions are especially useful for in-memory pointer-based data structures or other pure in-memory data. First class virtual address spaces have a significant advantage compared to forking a process and using inter process communication mechanism, namely that creating and switching between VAS is significant faster than creating and switching between processes. As it can be seen in the following table, measured on an Intel Xeon E5620 CPU with 2.40GHz, creating a VAS is about 7 times faster than forking and switching between VAS is up to 4 times faster than switching between processes. | VAS | processes | - switch | 468ns | 1944ns | create | 20003ns |150491ns | Hence, first class virtual address spaces provide a fast mechanism for applications to utilize multiple virtual address spaces in parallel with a higher performance than splitting up the application into multiple independent processes. Both VAS and VAS segments have another significant advantage when combined with non-volatile memory. Because of their independent life cycle from processes and other kernel data structures, they can be used to save special memory regions or even whole AS into non-volatile memory making it possible to reuse them across multiple system reboots. At the current state of the development, first class virtual address spaces have one limitation, that we haven't been able to solve so far. The feature allows, that different threads of the same process can execute in different AS at the same time. This is possible, because the VAS-switch operation only changes the active mm_struct for the task_struct of the calling thread. However, when a thread switches into a first class virtual address space, some parts of its original AS are duplicated into the new one to allow the thread to continue its execution at its current state. Accordingly, parts of the processes AS (e.g. the code section, data section, heap section and stack sections) exist in multiple AS if the process has a VAS attached to it. Changes to these shared memory regions are synchronized between the address spaces whenever a thread switches between two of them. Unfortunately, in some scenarios the kernel is not able to properly synchronize all these shared memory regions because of conflicting changes. One such example happens if there are two threads, one executing in an attached first class virtual address space, the other in the tasks original address space. If both threads make changes to the heap section that cause expansion of the underlying vm_area_struct, the kernel cannot correctly synchronize these changes, because that would cause parts of the virtual address space to be overwritten with unrelated data. In the current implementation such conflicts are only detected but not resolved and result in an error code being returned by the kernel during the VAS switch operation. Unfortunately, that means for the particular thread that tried to make the switch, that it cannot do this anymore in the future and accordingly has to be killed. This code was developed during an internship at Hewlett Packard Enterprise. [1] http://impact.crhc.illinois.edu/shared/Papers/ASPLOS16-SpaceJMP.pdf Till Smejkal (13): mm: Add mm_struct argument to 'mmap_region' mm: Add
[RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region'
Add to the 'mmap_region' function the mm_struct that it should operate on as additional argument. Before, the function simply used the memory map of the current task. However, with the introduction of first class virtual address spaces, mmap_region needs also be able to operate on other memory maps than only the current task ones. By adding it as argument we can now explicitly define which memory map to use. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- arch/mips/kernel/vdso.c | 2 +- arch/tile/mm/elf.c | 2 +- include/linux/mm.h | 5 +++-- mm/mmap.c | 10 +- 4 files changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index f9dbfb14af33..9631b42908f3 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -108,7 +108,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) return -EINTR; /* Map delay slot emulation page */ - base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, + base = mmap_region(mm, NULL, STACK_TOP, PAGE_SIZE, VM_READ|VM_WRITE|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0); diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 6225cc998db1..a22768059b7a 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -141,7 +141,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, */ if (!retval) { unsigned long addr = MEM_USER_INTRPT; - addr = mmap_region(NULL, addr, INTRPT_SIZE, + addr = mmap_region(mm, NULL, addr, INTRPT_SIZE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0); if (addr > (unsigned long) -PAGE_SIZE) diff --git a/include/linux/mm.h b/include/linux/mm.h index b84615b0f64c..fa483d2ff3eb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2016,8 +2016,9 @@ extern int install_special_mapping(struct mm_struct *mm, extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); -extern unsigned long mmap_region(struct file *file, unsigned long addr, - unsigned long len, vm_flags_t vm_flags, unsigned long pgoff); +extern unsigned long mmap_region(struct mm_struct *mm, struct file *file, +unsigned long addr, unsigned long len, +vm_flags_t vm_flags, unsigned long pgoff); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate); diff --git a/mm/mmap.c b/mm/mmap.c index dc4291dcc99b..5ac276ac9807 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1447,7 +1447,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } - addr = mmap_region(file, addr, len, vm_flags, pgoff); + addr = mmap_region(mm, file, addr, len, vm_flags, pgoff); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -1582,10 +1582,10 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; } -unsigned long mmap_region(struct file *file, unsigned long addr, - unsigned long len, vm_flags_t vm_flags, unsigned long pgoff) +unsigned long mmap_region(struct mm_struct *mm, struct file *file, + unsigned long addr, unsigned long len, vm_flags_t vm_flags, + unsigned long pgoff) { - struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; int error; struct rb_node **rb_link, *rb_parent; @@ -1704,7 +1704,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || - vma == get_gate_vma(current->mm))) + vma == get_gate_vma(mm))) mm->locked_vm += (len >> PAGE_SHIFT); else vma->vm_flags &= VM_LOCKED_CLEAR_MASK; -- 2.12.0 ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
[RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument
Rename the 'unmap_region' function to 'munmap_region' so that it uses the same naming pattern as the do_mmap <-> mmap_region couple. In addition also make the new 'munmap_region' function publicly available to all other kernel sources. In addition, also add to the function the mm_struct it should operate on as additional argument. Before, the function simply used the memory map of the current task. However, with the introduction of first class virtual address spaces, munmap_region need also be able to operate on other memory maps than just the current task's one. Accordingly, add a new argument to the function so that one can define explicitly which memory map should be used. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- include/linux/mm.h | 4 mm/mmap.c | 14 +- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fb11be77545f..71a90604d21f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2023,6 +2023,10 @@ extern unsigned long do_mmap(struct mm_struct *mm, struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate); + +extern void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma, + struct vm_area_struct *prev, unsigned long start, + unsigned long end); extern int do_munmap(struct mm_struct *, unsigned long, size_t); static inline unsigned long diff --git a/mm/mmap.c b/mm/mmap.c index 70028bf7b58d..ea79bc4da5b7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -70,10 +70,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS; static bool ignore_rlimit_data; core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644); -static void unmap_region(struct mm_struct *mm, - struct vm_area_struct *vma, struct vm_area_struct *prev, - unsigned long start, unsigned long end); - /* description of effects of mapping type and prot in current implementation. * this is due to the limited x86 page protection hardware. The expected * behavior is in parens: @@ -1731,7 +1727,7 @@ unsigned long mmap_region(struct mm_struct *mm, struct file *file, fput(file); /* Undo any partial mapping done by a device driver. */ - unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); + munmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); charged = 0; if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); @@ -2447,9 +2443,9 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma) * * Called with the mm semaphore held. */ -static void unmap_region(struct mm_struct *mm, - struct vm_area_struct *vma, struct vm_area_struct *prev, - unsigned long start, unsigned long end) +void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma, + struct vm_area_struct *prev, unsigned long start, + unsigned long end) { struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; struct mmu_gather tlb; @@ -2654,7 +2650,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) * Remove the vma's, and unmap the actual pages */ detach_vmas_to_be_unmapped(mm, vma, prev, end); - unmap_region(mm, vma, prev, start, end); + munmap_region(mm, vma, prev, start, end); arch_unmap(mm, vma, start, end); -- 2.12.0 ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
[RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff'
Add to the 'do_mmap' and 'do_mmap_pgoff' functions the mm_struct they should operate on as additional argument. Before, both functions simply used the memory map of the current task. However, with the introduction of first class virtual address spaces, these functions also need to be usable for other memory maps than just the one of the current process. Hence, explicitly define during the function call which memory map to use. Signed-off-by: Till Smejkal <till.smej...@gmail.com> --- arch/x86/mm/mpx.c | 4 ++-- fs/aio.c | 4 ++-- include/linux/mm.h | 11 ++- ipc/shm.c | 3 ++- mm/mmap.c | 16 mm/nommu.c | 7 --- mm/util.c | 2 +- 7 files changed, 25 insertions(+), 22 deletions(-) diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index af59f808742f..99c664a97c35 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -50,8 +50,8 @@ static unsigned long mpx_mmap(unsigned long len) return -EINVAL; down_write(>mmap_sem); - addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE, - MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, ); + addr = do_mmap(mm, NULL, 0, len, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, ); up_write(>mmap_sem); if (populate) mm_populate(addr, populate); diff --git a/fs/aio.c b/fs/aio.c index 873b4ca82ccb..df9bba5a2aff 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -510,8 +510,8 @@ static int aio_setup_ring(struct kioctx *ctx) return -EINTR; } - ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, - PROT_READ | PROT_WRITE, + ctx->mmap_base = do_mmap_pgoff(current->mm, ctx->aio_ring_file, 0, + ctx->mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, 0, ); up_write(>mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { diff --git a/include/linux/mm.h b/include/linux/mm.h index fa483d2ff3eb..fb11be77545f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2019,17 +2019,18 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct mm_struct *mm, struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff); -extern unsigned long do_mmap(struct file *file, unsigned long addr, - unsigned long len, unsigned long prot, unsigned long flags, - vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate); +extern unsigned long do_mmap(struct mm_struct *mm, struct file *file, + unsigned long addr, unsigned long len, unsigned long prot, + unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, + unsigned long *populate); extern int do_munmap(struct mm_struct *, unsigned long, size_t); static inline unsigned long -do_mmap_pgoff(struct file *file, unsigned long addr, +do_mmap_pgoff(struct mm_struct *mm, struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, unsigned long *populate) { - return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate); + return do_mmap(mm, file, addr, len, prot, flags, 0, pgoff, populate); } #ifdef CONFIG_MMU diff --git a/ipc/shm.c b/ipc/shm.c index 81203e8ba013..64c21fb32ca9 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1222,7 +1222,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, goto invalid; } - addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, ); + addr = do_mmap_pgoff(mm, file, addr, size, prot, flags, 0, +); *raddr = addr; err = 0; if (IS_ERR_VALUE(addr)) diff --git a/mm/mmap.c b/mm/mmap.c index 5ac276ac9807..70028bf7b58d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1299,14 +1299,14 @@ static inline int mlock_future_check(struct mm_struct *mm, } /* - * The caller must hold down_write(>mm->mmap_sem). + * The caller must hold down_write(>mmap_sem). */ -unsigned long do_mmap(struct file *file, unsigned long addr, - unsigned long len, unsigned long prot, - unsigned long flags, vm_flags_t vm_flags, - unsigned long pgoff, unsigned long *populate) +unsigned long do_mmap(struct mm_struct *mm, struct file *file, + unsigned long addr, unsigned long len, + unsigned long prot, unsigned long flags, + vm_flags_t vm_flags, unsigned long pgoff, + unsigned long *populate) { - struct mm_struct *mm = current->mm; int pkey = 0; *populate = 0; @@