Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
On 9/28/2012 8:16 PM, John Stultz wrote: There is two rough approaches that I have tried so far 1) Managing volatile range objects, in a tree or list, which are then purged using a shrinker 2) Page based management, where pages marked volatile are moved to a new LRU list and are purged from there. 1) This patchset is of the the shrinker-based approach. In many ways it is simpler, but it does have a few drawbacks. Basically when marking a range as volatile, we create a range object, and add it to an rbtree. This allows us to be able to quickly find ranges, given an address in the file. We also add each range object to the tail of a filesystem global linked list, which acts as an LRU allowing us to quickly find the least recently created volatile range. We then use a shrinker callback to trigger purging, where we'll select the range on the head of the LRU list, purge the data, mark the range object as purged, and remove it from the lru list. This allows fairly efficient behavior, as marking and unmarking a range are both O(logn) operation with respect to the number of ranges, to insert and remove from the tree. Purging the range is also O(1) to select the range, and we purge the entire range in least-recently-marked-volatile order. The drawbacks with this approach is that it uses a shrinker, thus it is numa un-aware. We track the virtual address of the pages in the file, so we don't have a sense of what physical pages we're using, nor on which node those pages may be on. So its possible on a multi-node system that when one node was under pressure, we'd purge volatile ranges that are all on a different node, in effect throwing data away without helping anything. This is clearly non-ideal for numa systems. One idea I discussed with Michel Lespinasse is that this might be something we could improve by providing the shrinker some node context, then keep track in the range what node their first page is on. That way we would be sure to at least free up one page on the node under pressure when purging that range. 2) The second approach, which was more page based, was also tried. In this case when we marked a range as volatile, the pages in that range were moved to a new lru list LRU _VOLATILE in vmscan.c. This provided a page lru list that could be used to free pages before looking at the LRU_INACTIVE_FILE/ANONYMOUS lists. This integrates the feature deeper in the mm code, which is nice, especially as we have an LRU_VOLATILE list for each numa node. Thus under pressure we won't purge ranges that are entirely on a different node, as is possible with the other approach. However, this approach is more costly. When marking a range as volatile, we have to migrate every page in that range to the LRU_VOLATILE list, and similarly on unmarking we have to move each page back. This ends up being O(n) with respect to the number of pages in the range we're marking or unmarking. Similarly when purging, we let the scanning code select a page off the lru, then we have to map it back to the volatile range so we can purge the entire range, making it a more expensive O(logn), with respect to the number of ranges, operation. This is a particular concern as applications that want to mark and unmark data as volatile with fine granularity will likely be calling these operations frequently, adding quite a bit of overhead. This makes it less likely that applications will choose to volunteer data as volatile to the system. However, with the new lazy SIGBUS notification, applications using the SIGBUS method would avoid having to mark and unmark data when accessing it, so this overhead may be less of a concern. However, for cases where applications don't want to deal with the SIGBUS and would rather have the more deterministic behavior of the unmark/access/mark pattern, the performance is a concern. Unfortunately, approach 1 is not useful for our use-case. It'll mean that we are continuously re-decompressing frequently used parts of libxul.so under memory pressure(which is pretty often on limited ram devices). Taras ps. John, I really appreciate movement on this. We really need this to improve Firefox memory usage + startup speed on low memory devices. Will be great to have Firefox start faster+ respond to memory pressure better on desktop Linux too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/8] Introduce new system call mvolatile
On 1/2/2013 8:27 PM, Minchan Kim wrote: This patch adds new system call m[no]volatile. If someone asks is_volatile system call, it could be added, too. The reason why I introduced new system call instead of madvise is m[no]volatile vma handling is totally different with madvise's vma handling. 1) The m[no]volatile should be successful although the range includes unmapped or non-volatile range. It just skips such range without stopping with returning error although it encounters invalid range. It makes user convenient without calling several system call of small range - Suggested by John Stultz 2) The purged state of volatile range should be propagated out to user although the range is merged with adjacent non-volatile range when user calls mnovolatile. 3) mvolatile's interface could be changed with madvise in future discussion. For example, I feel needs movlatile(start, len, mode). 'mode' means FULL_VOLATILE or PARTIAL_VOLATILE. FULL volatile means that if VM decide to reclaim the range, it would reclaim all of pages in the range but in case of PARTIAL_VOLATILE, VM could reclaim just a few number of pages in the range. In case of tmpfs-volatile, user may regenerate all images data once one of page in the range is discarded so there is pointless that VM discard a page in the range when memory pressure is severe. In case of anon-volatile, too excess discarding cause too many minor fault for the allocator so it would be better to discard part of the range. I don't understand point 3). Are you saying that using mvolatile in conjuction with madvise could allow mvolatile behavior to be tweaked in the future? Or are you suggesting adding an extra parameter in the future(what would that have to do with madvise)? 4) Having a new system call makes it easier for userspace apps to detect kernels without this functionality. I really like the proposed interface. I like the suggestion of having explicit FULL|PARTIAL_VOLATILE. Why not include PARTIAL_VOLATILE as a required 3rd param in first version with expectation that FULL_VOLATILE will be added later(and returning some not-supported error in meantime)? 3) The mvolatile system call's return value is quite different with madvise. Look at below semantic explanation. So I want to separate mvolatile from madvise. mvolatile(start, len)'s semantics 1) It makes range(start, len) as volatile although the range includes unmapped area, speacial mapping and mlocked area which are just skipped. Return -EINVAL if range doesn't include a right vma at all. Return -ENOMEM with interrupting range opeartion if memory is not enough to merge/split vmas. In this case, some ranges would be volatile and others not so user may recall mvolatile after he cancel all range by mnovolatile. Return 0 if range consists of only proper vmas. Return 1 if part of range includes hole/huge/ksm/mlock/special area. 2) If user calls mvolatile to the range which was already volatile VMA and even purged state, VOLATILE attributes still remains but purged state is reset. I expect some user want to split volatile vma into smaller ranges. Although he can do it for mnovlatile(whole range) and serveral calling with movlatile(smaller range), this function can avoid mnovolatile if he doesn't care purged state. I'm not sure we really need this function so I hope listen opinions. Unfortunately, current implemenation doesn't split volatile VMA with new range in this case. I forgot implementing it in this version but decide to send it to listen opinions because implementing is rather trivial if we decided. mnovolatile(start, len)'s semantics is following as. 1) It makes range(start, len) as non-volatile although the range includes unmapped area, speacial mapping and non-volatile range which are just skipped. 2) If the range is purged, it will return 1 regardless of including invalid range. If I understand this correctly: mvolatile(0, 10); //then range [9,10] is purged by kernel mnovolatile(0,4) will fail? that seems counterintuitive. One of the uses for mnovolatile is to atomicly lock the pages(vs a racy proposed is_volatile) syscall. Above situation would make it less effective. 3) It returns -ENOMEM if system doesn't have enough memory for vma operation. 4) It returns -EINVAL if range doesn't include a right vma at all. 5) If user try to access purged range without mnovoatile call, it encounters SIGBUS which would show up next patch. Cc: Michael Kerrisk Cc: Arun Sharma Cc: san...@google.com Cc: Paul Turner CC: David Rientjes Cc: John Stultz Cc: Andrew Morton Cc: Christoph Lameter Cc: Android Kernel Team Cc: Robert Love Cc: Mel Gorman Cc: Hugh Dickins Cc: Dave Hansen Cc: Rik van Riel Cc: Dave Chinner Cc: Neil Brown Cc: Mike Hommey Cc: Taras Glek Cc: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Signed-off-by: Mi
Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support
Dhaval Giani wrote: On 07/24/2013 07:36 PM, Jörn Engel wrote: On Wed, 24 July 2013 17:03:53 -0400, Dhaval Giani wrote: I am posting this series early in its development phase to solicit some feedback. At this state, a good description of the format would be nice. Sure. The format is quite simple. There is a 20 byte header followed by an offset table giving us the offsets of 16k compressed zlib chunks (The 16k is the default number, it can be changed with the use of szip tool, the kernel should still decompress it as that data is in the header). I am not tied to the format. I used it as that is what being used here. My final goal is the have the filesystem agnostic of the compression format as long as it is seekable. We are implementing transparent decompression with a focus on ext4. One of the main usecases is that of Firefox on Android. Currently libxul.so is compressed and it is loaded into memory by a custom linker on demand. With the use of transparent decompression, we can make do without the custom linker. More details (i.e. code) about the linker can be found at https://github.com/glandium/faulty.lib It is not quite clear what you want to achieve here. To introduce transparent decompression. Let someone else do the compression for us, and supply decompressed data on demand (in this case a read call). Reduces the complexity which would otherwise have to be brought into the filesystem. The main use for file compression for Firefox(it's useful on Linux desktop too) is to improve IO-throughput and reduce startup latency. In order for compression to be a net win an application should be aware of what is being compressed and what isn't. For example patterns for IO on large libraries (eg 30mb libxul.so) are well suited to compression, but SQLite databases are not. Similarly for our disk cache: images should not be compressed, but javascript should be. Footprint wins are useful on android, but it's the increased IO throughput on crappy storage devices that makes this most attractive. In addition of being aware of which files should be compressed, Firefox is aware of patterns of usage of various files it could schedule compression at the most optimal time. Above needs tie in nicely with the simplification of not implementing compression at fs-level. One approach is to create an empty file, chattr it to enable compression, then write uncompressed data to it. Nothing in userspace will ever know the file is compressed, unless you explicitly call lsattr. If you want to follow some other approach where userspace has one interface to write the compressed data to a file and some other interface to read the file uncompressed, you are likely in a world of pain. Why? If it is going to only be a few applications who know the file is compressed, and read it to get decompressed data, why would it be painful? What about introducing a new flag, O_COMPR which tells the kernel, btw, we want this file to be decompressed if it can be. It can fallback to O_RDONLY or something like that? That gets rid of the chattr ugliness. This transparent decompression idea is based on our experience with HFS+. Apple uses the fs-attribute approach. OSX is able to compress application libraries at installation-time, apps remain blissfully unaware but get an extra boost in startup perf. So in Linux, the package manager could compress .so files, textual data files, etc. Assuming you use the chattr approach, that pretty much comes down to adding compression support to ext4. There have been old patches for ext2 around that never got merged. Reading up on the problems encountered by those patches might be instructive. Do you have subjects for these? When I googled for ext4 compression, I found http://code.google.com/p/e4z/ which doesn't seem to exist, and checking in my LKML archives gives too many false positives. Thanks! Dhaval -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/