Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Thu, 16 Mar 2017, Thomas Gleixner wrote: > Why do we need yet another mechanism to represent something which looks > like a file instead of simply using existing mechanisms and extend them? You are right. I also recognized during the discussion with Andy, Chris, Matthew, Luck, Rich and the others that there are already other techniques in the Linux kernel that can achieve the same functionality when combined. As I said also to the others, I will drop the VAS segments for future versions. The first class virtual address space feature was the more interesting part of the patchset anyways. Thanks Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Thu, 16 Mar 2017, Thomas Gleixner wrote: > On Thu, 16 Mar 2017, Till Smejkal wrote: > > On Thu, 16 Mar 2017, Thomas Gleixner wrote: > > > Why do we need yet another mechanism to represent something which looks > > > like a file instead of simply using existing mechanisms and extend them? > > > > You are right. I also recognized during the discussion with Andy, Chris, > > Matthew, Luck, Rich and the others that there are already other > > techniques in the Linux kernel that can achieve the same functionality > > when combined. As I said also to the others, I will drop the VAS segments > > for future versions. The first class virtual address space feature was > > the more interesting part of the patchset anyways. > > While you are at it, could you please drop this 'first class' marketing as > well? It has zero technical value, really. Yes of course. I am sorry for the trouble that I caused already. Thanks Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Thu, 16 Mar 2017, Till Smejkal wrote: > On Thu, 16 Mar 2017, Thomas Gleixner wrote: > > Why do we need yet another mechanism to represent something which looks > > like a file instead of simply using existing mechanisms and extend them? > > You are right. I also recognized during the discussion with Andy, Chris, > Matthew, Luck, Rich and the others that there are already other > techniques in the Linux kernel that can achieve the same functionality > when combined. As I said also to the others, I will drop the VAS segments > for future versions. The first class virtual address space feature was > the more interesting part of the patchset anyways. While you are at it, could you please drop this 'first class' marketing as well? It has zero technical value, really. Thanks, tglx
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Till Smejkal wrote: > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > > VAS segments on the other side would provide a functionality to > > > achieve the same without the need of any mounted filesystem. However, > > > I agree, that this is just a small advantage compared to what can > > > already be achieved with the existing functionality provided by the > > > Linux kernel. > > > > I see this "small advantage" as "resource leak and security problem". > > I don't agree here. VAS segments are basically in-memory files that are > handled by the kernel directly without using a file system. Hence, if an Why do we need yet another mechanism to represent something which looks like a file instead of simply using existing mechanisms and extend them? Thanks, tglx
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Luck, Tony wrote: > On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote: > > I don't agree here. VAS segments are basically in-memory files that are > > handled by > > the kernel directly without using a file system. Hence, if an application > > uses a VAS > > segment to store data the same rules apply as if it uses a file. Everything > > that it > > saves in the VAS segment might be accessible by other applications. An > > application > > using VAS segments should be aware of this fact. In addition, the resources > > that are > > represented by a VAS segment are not leaked. As I said, VAS segments are > > much like > > files. Hence, if you don't want to use them any more, delete them. But as > > with files, > > the kernel will not delete them for you (although something like this can > > be added). > > So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)? > > Apart from VAS having better names, instead of silly "key_t key" ones. Unfortunately, I have to admit that the VAS segments don't differ from shm* a lot. The implementation is differently, but the functionality that you can achieve with it is very similar. I am sorry. We should have looked more closely at the whole functionality that is provided by the shmem subsystem before working on VAS segments. However, VAS segments are not the key part of this patch set. The more interesting functionality in our opinion is the introduction of first class virtual address spaces and what they can be used for. VAS segments were just another logical step for us (from first class virtual address spaces to first class virtual address space segments) but since their functionality can be achieved with various other already existing features of the Linux kernel, I will probably drop them in future versions of the patchset. Thanks Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote: > I don't agree here. VAS segments are basically in-memory files that are > handled by > the kernel directly without using a file system. Hence, if an application > uses a VAS > segment to store data the same rules apply as if it uses a file. Everything > that it > saves in the VAS segment might be accessible by other applications. An > application > using VAS segments should be aware of this fact. In addition, the resources > that are > represented by a VAS segment are not leaked. As I said, VAS segments are much > like > files. Hence, if you don't want to use them any more, delete them. But as > with files, > the kernel will not delete them for you (although something like this can be > added). So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)? Apart from VAS having better names, instead of silly "key_t key" ones. -Tony
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Andy Lutomirski wrote: > On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal > wrote: > > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > >> > One advantage of VAS segments is that they can be globally queried by > >> > user programs > >> > which means that VAS segments can be shared by applications that not > >> > necessarily have > >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data > >> > will only work > >> > if the tasks that share the memory region are related (aka. have a > >> > common parent that > >> > initialized the shared mapping). Otherwise, the shared mapping have to > >> > be backed by a > >> > file. > >> > >> What's wrong with memfd_create()? > >> > >> > VAS segments on the other side allow sharing of pure in memory data by > >> > arbitrary related tasks without the need of a file. This becomes > >> > especially > >> > interesting if one combines VAS segments with non-volatile memory since > >> > one can keep > >> > data structures in the NVM and still be able to share them between > >> > multiple tasks. > >> > >> What's wrong with regular mmap? > > > > I never wanted to say that there is something wrong with regular mmap. We > > just > > figured that with VAS segments you could remove the need to mmap your > > shared data but > > instead can keep everything purely in memory. > > memfd does that. Yes, that's right. Thanks for giving me the pointer to this. I should have researched more carefully before starting to work at VAS segments. > > VAS segments on the other side would provide a functionality to > > achieve the same without the need of any mounted filesystem. However, I > > agree, that > > this is just a small advantage compared to what can already be achieved > > with the > > existing functionality provided by the Linux kernel. > > I see this "small advantage" as "resource leak and security problem". I don't agree here. VAS segments are basically in-memory files that are handled by the kernel directly without using a file system. Hence, if an application uses a VAS segment to store data the same rules apply as if it uses a file. Everything that it saves in the VAS segment might be accessible by other applications. An application using VAS segments should be aware of this fact. In addition, the resources that are represented by a VAS segment are not leaked. As I said, VAS segments are much like files. Hence, if you don't want to use them any more, delete them. But as with files, the kernel will not delete them for you (although something like this can be added). > >> This sounds complicated and fragile. What happens if a heuristically > >> shared region coincides with a region in the "first class address > >> space" being selected? > > > > If such a conflict happens, the task cannot use the first class address > > space and the > > corresponding system call will return an error. However, with the current > > available > > virtual address space size that programs can use, such conflicts are > > probably rare. > > A bug that hits 1% of the time is often worse than one that hits 100% > of the time because debugging it is miserable. I don't agree that this is a bug at all. If there is a conflict in the memory layout of the ASes the application simply cannot use this first class virtual address space. Every application that wants to use first class virtual address spaces should check for error return values and handle them. This situation is similar to mapping a file at some special address in memory because the file contains pointer based data structures and the application wants to use them, but the kernel cannot map the file at this particular position in the application's AS because there is already a different conflicting mapping. If an application wants to do such things, it should also handle all the errors that can occur. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Rich Felker wrote: > On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote: > > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > > > One advantage of VAS segments is that they can be globally queried by > > > > user programs > > > > which means that VAS segments can be shared by applications that not > > > > necessarily have > > > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data > > > > will only work > > > > if the tasks that share the memory region are related (aka. have a > > > > common parent that > > > > initialized the shared mapping). Otherwise, the shared mapping have to > > > > be backed by a > > > > file. > > > > > > What's wrong with memfd_create()? > > > > > > > VAS segments on the other side allow sharing of pure in memory data by > > > > arbitrary related tasks without the need of a file. This becomes > > > > especially > > > > interesting if one combines VAS segments with non-volatile memory since > > > > one can keep > > > > data structures in the NVM and still be able to share them between > > > > multiple tasks. > > > > > > What's wrong with regular mmap? > > > > I never wanted to say that there is something wrong with regular mmap. We > > just > > figured that with VAS segments you could remove the need to mmap your > > shared data but > > instead can keep everything purely in memory. > > > > Unfortunately, I am not at full speed with memfds. Is my understanding > > correct that > > if the last user of such a file descriptor closes it, the corresponding > > memory is > > freed? Accordingly, memfd cannot be used to keep data in memory while no > > program is > > currently using it, can it? To be able to do this you need again some > > representation > > I have a name for application-allocated kernel resources that persist > without a process holding a reference to them or a node in the > filesystem: a bug. See: sysvipc. VAS segments are first class citizens of the OS similar to processes. Accordingly, I would not see this behavior as a bug. VAS segments are a kernel handle to "persistent" memory (in the sense that they are independent of the lifetime of the application that created them). That means the memory that is described by VAS segments can be reused by other applications even if the VAS segment was not used by any application in between. It is very much like a pure in-memory file. An application creates a VAS segment, fills it with content and if it does not delete it again, can reuse/open it again later. This also means, that if you know that you never want to use this memory again you have to remove it explicitly, like you have to remove a file, if you don't want to use it anymore. I think it really might be better to implement VAS segments (if I should keep this feature at all) with a special purpose filesystem. The way I've designed it seams to be very misleading. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal wrote: > On Wed, 15 Mar 2017, Andy Lutomirski wrote: >> > One advantage of VAS segments is that they can be globally queried by user >> > programs >> > which means that VAS segments can be shared by applications that not >> > necessarily have >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data >> > will only work >> > if the tasks that share the memory region are related (aka. have a common >> > parent that >> > initialized the shared mapping). Otherwise, the shared mapping have to be >> > backed by a >> > file. >> >> What's wrong with memfd_create()? >> >> > VAS segments on the other side allow sharing of pure in memory data by >> > arbitrary related tasks without the need of a file. This becomes especially >> > interesting if one combines VAS segments with non-volatile memory since >> > one can keep >> > data structures in the NVM and still be able to share them between >> > multiple tasks. >> >> What's wrong with regular mmap? > > I never wanted to say that there is something wrong with regular mmap. We just > figured that with VAS segments you could remove the need to mmap your shared > data but > instead can keep everything purely in memory. memfd does that. > > Unfortunately, I am not at full speed with memfds. Is my understanding > correct that > if the last user of such a file descriptor closes it, the corresponding > memory is > freed? Accordingly, memfd cannot be used to keep data in memory while no > program is > currently using it, can it? No, stop right here. If you want to have a bunch of memory that outlives the program that allocates it, use a filesystem (tmpfs, hugetlbfs, ext4, whatever). Don't create new persistent kernel things. > VAS segments on the other side would provide a functionality to > achieve the same without the need of any mounted filesystem. However, I > agree, that > this is just a small advantage compared to what can already be achieved with > the > existing functionality provided by the Linux kernel. I see this "small advantage" as "resource leak and security problem". >> This sounds complicated and fragile. What happens if a heuristically >> shared region coincides with a region in the "first class address >> space" being selected? > > If such a conflict happens, the task cannot use the first class address space > and the > corresponding system call will return an error. However, with the current > available > virtual address space size that programs can use, such conflicts are probably > rare. A bug that hits 1% of the time is often worse than one that hits 100% of the time because debugging it is miserable. --Andy
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote: > On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > > One advantage of VAS segments is that they can be globally queried by > > > user programs > > > which means that VAS segments can be shared by applications that not > > > necessarily have > > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data > > > will only work > > > if the tasks that share the memory region are related (aka. have a common > > > parent that > > > initialized the shared mapping). Otherwise, the shared mapping have to be > > > backed by a > > > file. > > > > What's wrong with memfd_create()? > > > > > VAS segments on the other side allow sharing of pure in memory data by > > > arbitrary related tasks without the need of a file. This becomes > > > especially > > > interesting if one combines VAS segments with non-volatile memory since > > > one can keep > > > data structures in the NVM and still be able to share them between > > > multiple tasks. > > > > What's wrong with regular mmap? > > I never wanted to say that there is something wrong with regular mmap. We just > figured that with VAS segments you could remove the need to mmap your shared > data but > instead can keep everything purely in memory. > > Unfortunately, I am not at full speed with memfds. Is my understanding > correct that > if the last user of such a file descriptor closes it, the corresponding > memory is > freed? Accordingly, memfd cannot be used to keep data in memory while no > program is > currently using it, can it? To be able to do this you need again some > representation I have a name for application-allocated kernel resources that persist without a process holding a reference to them or a node in the filesystem: a bug. See: sysvipc. Rich
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, 15 Mar 2017, Andy Lutomirski wrote: > > One advantage of VAS segments is that they can be globally queried by user > > programs > > which means that VAS segments can be shared by applications that not > > necessarily have > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > > only work > > if the tasks that share the memory region are related (aka. have a common > > parent that > > initialized the shared mapping). Otherwise, the shared mapping have to be > > backed by a > > file. > > What's wrong with memfd_create()? > > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > What's wrong with regular mmap? I never wanted to say that there is something wrong with regular mmap. We just figured that with VAS segments you could remove the need to mmap your shared data but instead can keep everything purely in memory. Unfortunately, I am not at full speed with memfds. Is my understanding correct that if the last user of such a file descriptor closes it, the corresponding memory is freed? Accordingly, memfd cannot be used to keep data in memory while no program is currently using it, can it? To be able to do this you need again some representation of the data in a file? Yes, you can use a tmpfs to keep the file content in memory as well, or some DAX filesystem to keep the file content in NVM, but this always requires that such filesystems are mounted in the system that the application is currently running on. VAS segments on the other side would provide a functionality to achieve the same without the need of any mounted filesystem. However, I agree, that this is just a small advantage compared to what can already be achieved with the existing functionality provided by the Linux kernel. I probably need to revisit the whole idea of first class virtual address space segments before continuing with this pacthset. Thank you very much for the great feedback. > >> >> Ick. Please don't do this. Can we please keep an mm as just an mm > >> >> and not make it look magically different depending on which process > >> >> maps it? If you need a trampoline (which you do, of course), just > >> >> write a trampoline in regular user code and map it manually. > >> > > >> > Did I understand you correctly that you are proposing that the switching > >> > thread > >> > should make sure by itself that its code, stack, … memory regions are > >> > properly setup > >> > in the new AS before/after switching into it? I think, this would make > >> > using first > >> > class virtual address spaces much more difficult for user applications > >> > to the extend > >> > that I am not even sure if they can be used at all. At the moment, > >> > switching into a > >> > VAS is a very simple operation for an application because the kernel > >> > will just simply > >> > do the right thing. > >> > >> Yes. I think that having the same mm_struct look different from > >> different tasks is problematic. Getting it right in the arch code is > >> going to be nasty. The heuristics of what to share are also tough -- > >> why would text + data + stack or whatever you're doing be adequate? > >> What if you're in a thread? What if two tasks have their stacks in > >> the same place? > > > > The different ASes that a task now can have when it uses first class > > virtual address > > spaces are not realized in the kernel by using only one mm_struct per task > > that just > > looks differently but by using multiple mm_structs - one for each AS that > > the task > > can execute in. When a task attaches a first class virtual address space to > > itself to > > be able to use another AS, the kernel adds a temporary mm_struct to this > > task that > > contains the mappings of the first class virtual address space and the one > > shared > > with the task's original AS. If a thread now wants to switch into this > > attached first > > class virtual address space the kernel only changes the 'mm' and > > 'active_mm' pointers > > in the task_struct of the thread to the temporary mm_struct and performs the > > corresponding mm_switch operation. The original mm_struct of the thread > > will not be > > changed. > > > > Accordingly, I do not magically make mm_structs look differently depending > > on the > > task that uses it, but create temporary mm_structs that only contain > > mappings to the > > same memory regions. > > This sounds complicated and fragile. What happens if a heuristically > shared region coincides with a region in the "first class address > space" being selected? If such a conflict happens, the task cannot use the first class address space and the corresponding system call will return an error. Ho
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Wed, Mar 15, 2017 at 09:51:31AM -0700, Andy Lutomirski wrote: > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > What's wrong with regular mmap? I think it's the usual misunderstandings about how to use mmap. >From the paper: Memory-centric computing demands careful organization of the virtual address space, but interfaces such as mmap only give limited control. Some systems do not support creation of address regions at specific offsets. In Linux, for example, mmap does not safely abort if a request is made to open a region of memory over an existing region; it simply writes over it. The correct answer of course, is "Don't specify MAP_FIXED". Specify the 'hint' address, and if you don't get it, either fix up your data structure pointers, or just abort and complain noisily.
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, Mar 14, 2017 at 9:12 AM, Till Smejkal wrote: > On Mon, 13 Mar 2017, Andy Lutomirski wrote: >> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal >> wrote: >> > On Mon, 13 Mar 2017, Andy Lutomirski wrote: >> >> This sounds rather complicated. Getting TLB flushing right seems >> >> tricky. Why not just map the same thing into multiple mms? >> > >> > This is exactly what happens at the end. The memory region that is >> > described by the >> > VAS segment will be mapped in the ASes that use the segment. >> >> So why is this kernel feature better than just doing MAP_SHARED >> manually in userspace? > > One advantage of VAS segments is that they can be globally queried by user > programs > which means that VAS segments can be shared by applications that not > necessarily have > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > only work > if the tasks that share the memory region are related (aka. have a common > parent that > initialized the shared mapping). Otherwise, the shared mapping have to be > backed by a > file. What's wrong with memfd_create()? > VAS segments on the other side allow sharing of pure in memory data by > arbitrary related tasks without the need of a file. This becomes especially > interesting if one combines VAS segments with non-volatile memory since one > can keep > data structures in the NVM and still be able to share them between multiple > tasks. What's wrong with regular mmap? > >> >> Ick. Please don't do this. Can we please keep an mm as just an mm >> >> and not make it look magically different depending on which process >> >> maps it? If you need a trampoline (which you do, of course), just >> >> write a trampoline in regular user code and map it manually. >> > >> > Did I understand you correctly that you are proposing that the switching >> > thread >> > should make sure by itself that its code, stack, … memory regions are >> > properly setup >> > in the new AS before/after switching into it? I think, this would make >> > using first >> > class virtual address spaces much more difficult for user applications to >> > the extend >> > that I am not even sure if they can be used at all. At the moment, >> > switching into a >> > VAS is a very simple operation for an application because the kernel will >> > just simply >> > do the right thing. >> >> Yes. I think that having the same mm_struct look different from >> different tasks is problematic. Getting it right in the arch code is >> going to be nasty. The heuristics of what to share are also tough -- >> why would text + data + stack or whatever you're doing be adequate? >> What if you're in a thread? What if two tasks have their stacks in >> the same place? > > The different ASes that a task now can have when it uses first class virtual > address > spaces are not realized in the kernel by using only one mm_struct per task > that just > looks differently but by using multiple mm_structs - one for each AS that the > task > can execute in. When a task attaches a first class virtual address space to > itself to > be able to use another AS, the kernel adds a temporary mm_struct to this task > that > contains the mappings of the first class virtual address space and the one > shared > with the task's original AS. If a thread now wants to switch into this > attached first > class virtual address space the kernel only changes the 'mm' and 'active_mm' > pointers > in the task_struct of the thread to the temporary mm_struct and performs the > corresponding mm_switch operation. The original mm_struct of the thread will > not be > changed. > > Accordingly, I do not magically make mm_structs look differently depending on > the > task that uses it, but create temporary mm_structs that only contain mappings > to the > same memory regions. This sounds complicated and fragile. What happens if a heuristically shared region coincides with a region in the "first class address space" being selected? I think the right solution is "you're a user program playing virtual address games -- make sure you do it right". --Andy
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Chris Metcalf wrote: > On 3/14/2017 12:12 PM, Till Smejkal wrote: > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > > > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal > > > wrote: > > > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > > > > > This sounds rather complicated. Getting TLB flushing right seems > > > > > tricky. Why not just map the same thing into multiple mms? > > > > This is exactly what happens at the end. The memory region that is > > > > described by the > > > > VAS segment will be mapped in the ASes that use the segment. > > > So why is this kernel feature better than just doing MAP_SHARED > > > manually in userspace? > > One advantage of VAS segments is that they can be globally queried by user > > programs > > which means that VAS segments can be shared by applications that not > > necessarily have > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > > only work > > if the tasks that share the memory region are related (aka. have a common > > parent that > > initialized the shared mapping). Otherwise, the shared mapping have to be > > backed by a > > file. > > True, but why is this bad? The shared mapping will be memory resident > regardless, even if backed by a file (unless swapped out under heavy > memory pressure, but arguably that's a feature anyway). More importantly, > having a file name is a simple and consistent way of identifying such > shared memory segments. > > With a little work, you can also arrange to map such files into memory > at a fixed address in all participating processes, thus making internal > pointers work correctly. I don't want to say that the interface provided by MAP_SHARED is bad. I am only arguing that VAS segments and the interface that they provide have an advantage over the existing ones in my opinion. However, Matthew Wilcox also suggested in some earlier mail that VAS segments could be exported to user space via a special purpose filesystem. This would enable users of VAS segments to also just use some special files to setup the shared memory regions. But since the VAS segment itself already knows where at has to be mapped in the virtual address space of the process, the establishing of the shared memory region would be very easy for the user. > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > I am not fully up to speed on NV/pmem stuff, but isn't that exactly what > the DAX mode is supposed to allow you to do? If so, isn't sharing a > mapped file on a DAX filesystem on top of pmem equivalent to what > you're proposing? If I read the documentation to DAX filesystems correctly, it is indeed possible to us them to create files that life purely in NVM. I wasn't fully aware of this feature. Thanks for the pointer. However, the main contribution of this patchset is actually the idea of first class virtual address spaces and that they can be used to allow processes to have multiple different views on the system's main memory. For us, VAS segments were another logic step in the same direction (from first class virtual address spaces to first class address space segments). However, if there is already functionality in the Linux kernel to achieve the exact same behavior, there is no real need to add VAS segments. I will continue thinking about them and either find a different situation where the currently available interface is not sufficient/too complicated or drop VAS segments from future version of the patch set. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On 3/14/2017 12:12 PM, Till Smejkal wrote: On Mon, 13 Mar 2017, Andy Lutomirski wrote: On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal wrote: On Mon, 13 Mar 2017, Andy Lutomirski wrote: This sounds rather complicated. Getting TLB flushing right seems tricky. Why not just map the same thing into multiple mms? This is exactly what happens at the end. The memory region that is described by the VAS segment will be mapped in the ASes that use the segment. So why is this kernel feature better than just doing MAP_SHARED manually in userspace? One advantage of VAS segments is that they can be globally queried by user programs which means that VAS segments can be shared by applications that not necessarily have to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work if the tasks that share the memory region are related (aka. have a common parent that initialized the shared mapping). Otherwise, the shared mapping have to be backed by a file. True, but why is this bad? The shared mapping will be memory resident regardless, even if backed by a file (unless swapped out under heavy memory pressure, but arguably that's a feature anyway). More importantly, having a file name is a simple and consistent way of identifying such shared memory segments. With a little work, you can also arrange to map such files into memory at a fixed address in all participating processes, thus making internal pointers work correctly. VAS segments on the other side allow sharing of pure in memory data by arbitrary related tasks without the need of a file. This becomes especially interesting if one combines VAS segments with non-volatile memory since one can keep data structures in the NVM and still be able to share them between multiple tasks. I am not fully up to speed on NV/pmem stuff, but isn't that exactly what the DAX mode is supposed to allow you to do? If so, isn't sharing a mapped file on a DAX filesystem on top of pmem equivalent to what you're proposing? -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, 13 Mar 2017, Andy Lutomirski wrote: > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal > wrote: > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > >> This sounds rather complicated. Getting TLB flushing right seems > >> tricky. Why not just map the same thing into multiple mms? > > > > This is exactly what happens at the end. The memory region that is > > described by the > > VAS segment will be mapped in the ASes that use the segment. > > So why is this kernel feature better than just doing MAP_SHARED > manually in userspace? One advantage of VAS segments is that they can be globally queried by user programs which means that VAS segments can be shared by applications that not necessarily have to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work if the tasks that share the memory region are related (aka. have a common parent that initialized the shared mapping). Otherwise, the shared mapping have to be backed by a file. VAS segments on the other side allow sharing of pure in memory data by arbitrary related tasks without the need of a file. This becomes especially interesting if one combines VAS segments with non-volatile memory since one can keep data structures in the NVM and still be able to share them between multiple tasks. > >> Ick. Please don't do this. Can we please keep an mm as just an mm > >> and not make it look magically different depending on which process > >> maps it? If you need a trampoline (which you do, of course), just > >> write a trampoline in regular user code and map it manually. > > > > Did I understand you correctly that you are proposing that the switching > > thread > > should make sure by itself that its code, stack, … memory regions are > > properly setup > > in the new AS before/after switching into it? I think, this would make > > using first > > class virtual address spaces much more difficult for user applications to > > the extend > > that I am not even sure if they can be used at all. At the moment, > > switching into a > > VAS is a very simple operation for an application because the kernel will > > just simply > > do the right thing. > > Yes. I think that having the same mm_struct look different from > different tasks is problematic. Getting it right in the arch code is > going to be nasty. The heuristics of what to share are also tough -- > why would text + data + stack or whatever you're doing be adequate? > What if you're in a thread? What if two tasks have their stacks in > the same place? The different ASes that a task now can have when it uses first class virtual address spaces are not realized in the kernel by using only one mm_struct per task that just looks differently but by using multiple mm_structs - one for each AS that the task can execute in. When a task attaches a first class virtual address space to itself to be able to use another AS, the kernel adds a temporary mm_struct to this task that contains the mappings of the first class virtual address space and the one shared with the task's original AS. If a thread now wants to switch into this attached first class virtual address space the kernel only changes the 'mm' and 'active_mm' pointers in the task_struct of the thread to the temporary mm_struct and performs the corresponding mm_switch operation. The original mm_struct of the thread will not be changed. Accordingly, I do not magically make mm_structs look differently depending on the task that uses it, but create temporary mm_structs that only contain mappings to the same memory regions. I agree that finding a good heuristics of what to share is difficult. At the moment, all memory regions that are available in the task's original AS will also be available when a thread switches into an attached first class virtual address space (aka. are shared). That means that VAS can mainly be used to extend the AS of a task in the current state of the implementation. The reason why I implemented the sharing in this way is that I didn't want to break shared libraries. If I only share code+heap+stack, shared libraries would not work anymore after switching into a VAS. > I could imagine something like a sigaltstack() mode that lets you set > a signal up to also switch mm could be useful. This is a very interesting idea. I will keep it in mind for future use cases of multiple virtual address spaces per task. Thanks Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal wrote: > On Mon, 13 Mar 2017, Andy Lutomirski wrote: >> This sounds rather complicated. Getting TLB flushing right seems >> tricky. Why not just map the same thing into multiple mms? > > This is exactly what happens at the end. The memory region that is described > by the > VAS segment will be mapped in the ASes that use the segment. So why is this kernel feature better than just doing MAP_SHARED manually in userspace? >> Ick. Please don't do this. Can we please keep an mm as just an mm >> and not make it look magically different depending on which process >> maps it? If you need a trampoline (which you do, of course), just >> write a trampoline in regular user code and map it manually. > > Did I understand you correctly that you are proposing that the switching > thread > should make sure by itself that its code, stack, … memory regions are > properly setup > in the new AS before/after switching into it? I think, this would make using > first > class virtual address spaces much more difficult for user applications to the > extend > that I am not even sure if they can be used at all. At the moment, switching > into a > VAS is a very simple operation for an application because the kernel will > just simply > do the right thing. Yes. I think that having the same mm_struct look different from different tasks is problematic. Getting it right in the arch code is going to be nasty. The heuristics of what to share are also tough -- why would text + data + stack or whatever you're doing be adequate? What if you're in a thread? What if two tasks have their stacks in the same place? I could imagine something like a sigaltstack() mode that lets you set a signal up to also switch mm could be useful.
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, 13 Mar 2017, Andy Lutomirski wrote: > On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal > wrote: > > This patchset extends the kernel memory management subsystem with a new > > type of address spaces (called VAS) which can be created and destroyed > > independently of processes by a user in the system. During its lifetime > > such a VAS can be attached to processes by the user which allows a process > > to have multiple address spaces and thereby multiple, potentially > > different, views on the system's main memory. During its execution the > > threads belonging to the process are able to switch freely between the > > different attached VAS and the process' original AS enabling them to > > utilize the different available views on the memory. > > Sounds like the old SKAS feature for UML. I haven't heard of this feature before, but after shortly looking at the description on the UML website it actually has some similarities with what I am proposing. But as far as I can see this was not merged into the mainline kernel, was it? In addition, I think that first class virtual address spaces goes even one step further by allowing AS to live independently of processes. > > In addition to the concept of first class virtual address spaces, this > > patchset introduces yet another feature called VAS segments. VAS segments > > are memory regions which have a fixed size and position in the virtual > > address space and can be shared between multiple first class virtual > > address spaces. Such shareable memory regions are especially useful for > > in-memory pointer-based data structures or other pure in-memory data. > > This sounds rather complicated. Getting TLB flushing right seems > tricky. Why not just map the same thing into multiple mms? This is exactly what happens at the end. The memory region that is described by the VAS segment will be mapped in the ASes that use the segment. > > > > | VAS | processes | > > - > > switch | 468ns | 1944ns | > > The solution here is IMO to fix the scheduler. IMHO it will be very difficult for the scheduler code to reach the same switching time as the pure VAS switch because switching between VAS does not involve saving any registers or FPU state and does not require selecting the next runnable task. VAS switch is basically a system call that just changes the AS of the current thread which makes it a very lightweight operation. > Also, FWIW, I have patches (that need a little work) that will make > switch_mm() wy faster on x86. These patches will also improve the speed of the VAS switch operation. We are also using the switch_mm function in the background to perform the actual hardware switch between the two ASes. The main reason why the VAS switch is faster than the task switch is that it just has to do fewer things. > > At the current state of the development, first class virtual address spaces > > have one limitation, that we haven't been able to solve so far. The feature > > allows, that different threads of the same process can execute in different > > AS at the same time. This is possible, because the VAS-switch operation > > only changes the active mm_struct for the task_struct of the calling > > thread. However, when a thread switches into a first class virtual address > > space, some parts of its original AS are duplicated into the new one to > > allow the thread to continue its execution at its current state. > > Ick. Please don't do this. Can we please keep an mm as just an mm > and not make it look magically different depending on which process > maps it? If you need a trampoline (which you do, of course), just > write a trampoline in regular user code and map it manually. Did I understand you correctly that you are proposing that the switching thread should make sure by itself that its code, stack, … memory regions are properly setup in the new AS before/after switching into it? I think, this would make using first class virtual address spaces much more difficult for user applications to the extend that I am not even sure if they can be used at all. At the moment, switching into a VAS is a very simple operation for an application because the kernel will just simply do the right thing. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Richard Henderson wrote: > On 03/14/2017 10:39 AM, Till Smejkal wrote: > > > Is this an indication that full virtual address spaces are useless? It > > > would seem like if you only use virtual address segments then you avoid > > > all > > > of the problems with executing code, active stacks, and brk. > > > > What do you mean with *virtual address segments*? The nice part of first > > class > > virtual address spaces is that one can share/reuse collections of address > > space > > segments easily. > > What do *I* mean? You introduced the term, didn't you? > Rereading your original I see you called them "VAS segments". Oh, I am sorry. I thought that you were referring to some other feature that I don't know. > Anyway, whatever they are called, it would seem that these segments do not > require any of the syncing mechanisms that are causing you problems. Yes, VAS segments provide a possibility to share memory regions between multiple address spaces without the need to synchronize heap, stack, etc. Unfortunately, the VAS segment feature itself without the whole concept of first class virtual address spaces is not as powerful. With some additional work it can probably be represented with the existing shmem functionality. The first class virtual address space feature on the other side provides a real benefit for applications in our opinion namely that an application can switch between different views on its memory which enables various interesting programming paradigms as mentioned in the cover letter. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On 03/14/2017 10:39 AM, Till Smejkal wrote: Is this an indication that full virtual address spaces are useless? It would seem like if you only use virtual address segments then you avoid all of the problems with executing code, active stacks, and brk. What do you mean with *virtual address segments*? The nice part of first class virtual address spaces is that one can share/reuse collections of address space segments easily. What do *I* mean? You introduced the term, didn't you? Rereading your original I see you called them "VAS segments". Anyway, whatever they are called, it would seem that these segments do not require any of the syncing mechanisms that are causing you problems. r~
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal wrote: > This patchset extends the kernel memory management subsystem with a new > type of address spaces (called VAS) which can be created and destroyed > independently of processes by a user in the system. During its lifetime > such a VAS can be attached to processes by the user which allows a process > to have multiple address spaces and thereby multiple, potentially > different, views on the system's main memory. During its execution the > threads belonging to the process are able to switch freely between the > different attached VAS and the process' original AS enabling them to > utilize the different available views on the memory. Sounds like the old SKAS feature for UML. > In addition to the concept of first class virtual address spaces, this > patchset introduces yet another feature called VAS segments. VAS segments > are memory regions which have a fixed size and position in the virtual > address space and can be shared between multiple first class virtual > address spaces. Such shareable memory regions are especially useful for > in-memory pointer-based data structures or other pure in-memory data. This sounds rather complicated. Getting TLB flushing right seems tricky. Why not just map the same thing into multiple mms? > > | VAS | processes | > - > switch | 468ns | 1944ns | The solution here is IMO to fix the scheduler. Also, FWIW, I have patches (that need a little work) that will make switch_mm() wy faster on x86. > At the current state of the development, first class virtual address spaces > have one limitation, that we haven't been able to solve so far. The feature > allows, that different threads of the same process can execute in different > AS at the same time. This is possible, because the VAS-switch operation > only changes the active mm_struct for the task_struct of the calling > thread. However, when a thread switches into a first class virtual address > space, some parts of its original AS are duplicated into the new one to > allow the thread to continue its execution at its current state. Ick. Please don't do this. Can we please keep an mm as just an mm and not make it look magically different depending on which process maps it? If you need a trampoline (which you do, of course), just write a trampoline in regular user code and map it manually. --Andy
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Richard Henderson wrote: > On 03/14/2017 08:14 AM, Till Smejkal wrote: > > At the current state of the development, first class virtual address spaces > > have one limitation, that we haven't been able to solve so far. The feature > > allows, that different threads of the same process can execute in different > > AS at the same time. This is possible, because the VAS-switch operation > > only changes the active mm_struct for the task_struct of the calling > > thread. However, when a thread switches into a first class virtual address > > space, some parts of its original AS are duplicated into the new one to > > allow the thread to continue its execution at its current state. > > Accordingly, parts of the processes AS (e.g. the code section, data > > section, heap section and stack sections) exist in multiple AS if the > > process has a VAS attached to it. Changes to these shared memory regions > > are synchronized between the address spaces whenever a thread switches > > between two of them. Unfortunately, in some scenarios the kernel is not > > able to properly synchronize all these shared memory regions because of > > conflicting changes. One such example happens if there are two threads, one > > executing in an attached first class virtual address space, the other in > > the tasks original address space. If both threads make changes to the heap > > section that cause expansion of the underlying vm_area_struct, the kernel > > cannot correctly synchronize these changes, because that would cause parts > > of the virtual address space to be overwritten with unrelated data. In the > > current implementation such conflicts are only detected but not resolved > > and result in an error code being returned by the kernel during the VAS > > switch operation. Unfortunately, that means for the particular thread that > > tried to make the switch, that it cannot do this anymore in the future and > > accordingly has to be killed. > > This sounds like a fairly fundamental problem to me. Yes I agree. This is a significant limitation of first class virtual address spaces. However, conflict like this can be mitigated by being careful in the application that uses multiple first class virtual address spaces. If all threads make sure that they never resize shared memory regions when executing inside a VAS such conflicts do not occur. Another possibility that I investigated but not yet finished is that such resizes of shared memory regions have to be synchronized more frequently than just at every switch between VASes. If one for example "forward" memory region resizes to all AS that share this particular memory region during the resize operation, one can completely eliminate this problem. Unfortunately, this introduces a significant cost and introduces a difficult to handle race condition. > Is this an indication that full virtual address spaces are useless? It > would seem like if you only use virtual address segments then you avoid all > of the problems with executing code, active stacks, and brk. What do you mean with *virtual address segments*? The nice part of first class virtual address spaces is that one can share/reuse collections of address space segments easily. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On 03/14/2017 08:14 AM, Till Smejkal wrote: At the current state of the development, first class virtual address spaces have one limitation, that we haven't been able to solve so far. The feature allows, that different threads of the same process can execute in different AS at the same time. This is possible, because the VAS-switch operation only changes the active mm_struct for the task_struct of the calling thread. However, when a thread switches into a first class virtual address space, some parts of its original AS are duplicated into the new one to allow the thread to continue its execution at its current state. Accordingly, parts of the processes AS (e.g. the code section, data section, heap section and stack sections) exist in multiple AS if the process has a VAS attached to it. Changes to these shared memory regions are synchronized between the address spaces whenever a thread switches between two of them. Unfortunately, in some scenarios the kernel is not able to properly synchronize all these shared memory regions because of conflicting changes. One such example happens if there are two threads, one executing in an attached first class virtual address space, the other in the tasks original address space. If both threads make changes to the heap section that cause expansion of the underlying vm_area_struct, the kernel cannot correctly synchronize these changes, because that would cause parts of the virtual address space to be overwritten with unrelated data. In the current implementation such conflicts are only detected but not resolved and result in an error code being returned by the kernel during the VAS switch operation. Unfortunately, that means for the particular thread that tried to make the switch, that it cannot do this anymore in the future and accordingly has to be killed. This sounds like a fairly fundamental problem to me. Is this an indication that full virtual address spaces are useless? It would seem like if you only use virtual address segments then you avoid all of the problems with executing code, active stacks, and brk. r~
[RFC PATCH 00/13] Introduce first class virtual address spaces
First class virtual address spaces (also called VAS) are a new functionality of the Linux kernel allowing address spaces to exist independently of processes. The general idea behind this feature is described in a paper at ASPLOS16 with the title 'SpaceJMP: Programming with Multiple Virtual Address Spaces' [1]. This patchset extends the kernel memory management subsystem with a new type of address spaces (called VAS) which can be created and destroyed independently of processes by a user in the system. During its lifetime such a VAS can be attached to processes by the user which allows a process to have multiple address spaces and thereby multiple, potentially different, views on the system's main memory. During its execution the threads belonging to the process are able to switch freely between the different attached VAS and the process' original AS enabling them to utilize the different available views on the memory. These multiple virtual address spaces per process and the possibility to switch between them freely can be used in multiple interesting ways as also outlined in the mentioned paper. Some of the many possible applications are for example to compartmentalize a process for security reasons, to improve the performance of data-centric applications and to introduce new application models [1]. In addition to the concept of first class virtual address spaces, this patchset introduces yet another feature called VAS segments. VAS segments are memory regions which have a fixed size and position in the virtual address space and can be shared between multiple first class virtual address spaces. Such shareable memory regions are especially useful for in-memory pointer-based data structures or other pure in-memory data. First class virtual address spaces have a significant advantage compared to forking a process and using inter process communication mechanism, namely that creating and switching between VAS is significant faster than creating and switching between processes. As it can be seen in the following table, measured on an Intel Xeon E5620 CPU with 2.40GHz, creating a VAS is about 7 times faster than forking and switching between VAS is up to 4 times faster than switching between processes. | VAS | processes | - switch | 468ns | 1944ns | create | 20003ns |150491ns | Hence, first class virtual address spaces provide a fast mechanism for applications to utilize multiple virtual address spaces in parallel with a higher performance than splitting up the application into multiple independent processes. Both VAS and VAS segments have another significant advantage when combined with non-volatile memory. Because of their independent life cycle from processes and other kernel data structures, they can be used to save special memory regions or even whole AS into non-volatile memory making it possible to reuse them across multiple system reboots. At the current state of the development, first class virtual address spaces have one limitation, that we haven't been able to solve so far. The feature allows, that different threads of the same process can execute in different AS at the same time. This is possible, because the VAS-switch operation only changes the active mm_struct for the task_struct of the calling thread. However, when a thread switches into a first class virtual address space, some parts of its original AS are duplicated into the new one to allow the thread to continue its execution at its current state. Accordingly, parts of the processes AS (e.g. the code section, data section, heap section and stack sections) exist in multiple AS if the process has a VAS attached to it. Changes to these shared memory regions are synchronized between the address spaces whenever a thread switches between two of them. Unfortunately, in some scenarios the kernel is not able to properly synchronize all these shared memory regions because of conflicting changes. One such example happens if there are two threads, one executing in an attached first class virtual address space, the other in the tasks original address space. If both threads make changes to the heap section that cause expansion of the underlying vm_area_struct, the kernel cannot correctly synchronize these changes, because that would cause parts of the virtual address space to be overwritten with unrelated data. In the current implementation such conflicts are only detected but not resolved and result in an error code being returned by the kernel during the VAS switch operation. Unfortunately, that means for the particular thread that tried to make the switch, that it cannot do this anymore in the future and accordingly has to be killed. This code was developed during an internship at Hewlett Packard Enterprise. [1] http://impact.crhc.illinois.edu/shared/Papers/ASPLOS16-SpaceJMP.pdf Till Smejkal (13): mm: Add mm_struct argument to 'mmap_region' mm: Add m