Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words

2012-10-02 Thread Taras Glek

On 9/28/2012 8:16 PM, John Stultz wrote:


There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.



1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks.  Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file.  We also add each range object to the tail of a  filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree.  Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range  what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.


2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly.  When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn),  with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.
Unfortunately, approach 1 is not useful for our use-case. It'll mean 
that we are continuously re-decompressing frequently used parts of 
libxul.so under memory pressure(which is pretty often on limited ram 
devices).



Taras

ps. John, I really appreciate movement on this. We really need this to 
improve Firefox memory usage + startup speed on low memory devices. Will 
be great to have Firefox start faster+ respond to memory pressure better 
on desktop Linux too.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/8] Introduce new system call mvolatile

2013-01-03 Thread Taras Glek

On 1/2/2013 8:27 PM, Minchan Kim wrote:

This patch adds new system call m[no]volatile.
If someone asks is_volatile system call, it could be added, too.

The reason why I introduced new system call instead of madvise is
m[no]volatile vma handling is totally different with madvise's vma
handling.

1) The m[no]volatile should be successful although the range includes
unmapped or non-volatile range. It just skips such range
without stopping with returning error although it encounters
invalid range. It makes user convenient without calling several
system call of small range - Suggested by John Stultz

2) The purged state of volatile range should be propagated out to user
although the range is merged with adjacent non-volatile range when
user calls mnovolatile.

3) mvolatile's interface could be changed with madvise
in future discussion.  For example, I feel needs
movlatile(start, len, mode).
'mode' means FULL_VOLATILE or PARTIAL_VOLATILE.
FULL volatile means that if VM decide to reclaim the range, it would
reclaim all of pages in the range but in case of PARTIAL_VOLATILE,
VM could reclaim just a few number of pages in the range.
In case of tmpfs-volatile, user may regenerate all images data once
one of page in the range is discarded so there is pointless that
VM discard a page in the range when memory pressure is severe.
In case of anon-volatile, too excess discarding cause too many minor
fault for the allocator so it would be better to discard part of
the range.

I don't understand point 3).
Are you saying that using mvolatile in conjuction with madvise could 
allow mvolatile behavior to be tweaked in the future? Or are you 
suggesting adding an extra parameter in the future(what would that have 
to do with madvise)?


4) Having a new system call makes it easier for userspace apps to detect 
kernels without this functionality.


I really like the proposed interface. I like the suggestion of having 
explicit FULL|PARTIAL_VOLATILE. Why not include PARTIAL_VOLATILE as a 
required 3rd param in first version with expectation that FULL_VOLATILE 
will be added later(and returning some not-supported error in meantime)?


3) The mvolatile system call's return value is quite different with
madvise. Look at below semantic explanation.

So I want to separate mvolatile from madvise.

mvolatile(start, len)'s semantics

1) It makes range(start, len) as volatile although the range includes
unmapped area, speacial mapping and mlocked area which are just skipped.

Return -EINVAL if range doesn't include a right vma at all.
Return -ENOMEM with interrupting range opeartion if memory is not
enough to merge/split vmas. In this case, some ranges would be
volatile and others not so user may recall mvolatile after he
cancel all range by mnovolatile.
Return 0 if range consists of only proper vmas.
Return 1 if part of range includes hole/huge/ksm/mlock/special area.

2) If user calls mvolatile to the range which was already volatile VMA and
even purged state, VOLATILE attributes still remains but purged state
is reset. I expect some user want to split volatile vma into smaller
ranges. Although he can do it for mnovlatile(whole range) and serveral calling
with movlatile(smaller range), this function can avoid mnovolatile if he
doesn't care purged state. I'm not sure we really need this function so
I hope listen opinions. Unfortunately, current implemenation doesn't split
volatile VMA with new range in this case. I forgot implementing it
in this version but decide to send it to listen opinions because
implementing is rather trivial if we decided.

mnovolatile(start, len)'s semantics is following as.

1) It makes range(start, len) as non-volatile although the range
includes unmapped area, speacial mapping and non-volatile range
which are just skipped.

2) If the range is purged, it will return 1 regardless of including
invalid range.

If I understand this correctly:
mvolatile(0, 10);
//then range [9,10] is purged by kernel
mnovolatile(0,4) will fail?
that seems counterintuitive.

One of the uses for mnovolatile is to atomicly lock the pages(vs a racy 
proposed is_volatile) syscall. Above situation would make it less effective.





3) It returns -ENOMEM if system doesn't have enough memory for vma operation.

4) It returns -EINVAL if range doesn't include a right vma at all.

5) If user try to access purged range without mnovoatile call, it encounters
SIGBUS which would show up next patch.

Cc: Michael Kerrisk 
Cc: Arun Sharma 
Cc: san...@google.com
Cc: Paul Turner 
CC: David Rientjes 
Cc: John Stultz 
Cc: Andrew Morton 
Cc: Christoph Lameter 
Cc: Android Kernel Team 
Cc: Robert Love 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Mike Hommey 
Cc: Taras Glek 
Cc: KOSAKI Motohiro 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Mi

Re: [RFC/PATCH 0/2] ext4: Transparent Decompression Support

2013-07-25 Thread Taras Glek



Dhaval Giani wrote:

On 07/24/2013 07:36 PM, Jörn Engel wrote:

On Wed, 24 July 2013 17:03:53 -0400, Dhaval Giani wrote:

I am posting this series early in its development phase to solicit some
feedback.

At this state, a good description of the format would be nice.


Sure. The format is quite simple. There is a 20 byte header followed 
by an offset table giving us the offsets of 16k compressed zlib chunks 
(The 16k is the default number, it can be changed with the use of szip 
tool, the kernel should still decompress it as that data is in the 
header). I am not tied to the format. I used it as that is what being 
used here. My final goal is the have the filesystem agnostic of the 
compression format as long as it is seekable.





We are implementing transparent decompression with a focus on ext4. One
of the main usecases is that of Firefox on Android. Currently libxul.so
is compressed and it is loaded into memory by a custom linker on
demand. With the use of transparent decompression, we can make do
without the custom linker. More details (i.e. code) about the linker 
can

be found at https://github.com/glandium/faulty.lib

It is not quite clear what you want to achieve here.


To introduce transparent decompression. Let someone else do the 
compression for us, and supply decompressed data on demand  (in this 
case a read call). Reduces the complexity which would otherwise have 
to be brought into the filesystem.
The main use for file compression for Firefox(it's useful on Linux 
desktop too) is to improve IO-throughput and reduce startup latency. In 
order for compression to be a net win an application should be aware of 
what is being compressed and what isn't. For example patterns for IO on 
large libraries (eg 30mb libxul.so) are well suited to compression, but 
SQLite databases are not.  Similarly for our disk cache: images should 
not be compressed, but javascript should be. Footprint wins are useful 
on android, but it's the increased IO throughput on crappy storage 
devices that makes this most attractive.


In addition of being aware of which files should be compressed, Firefox 
is aware of patterns of usage of various files it could schedule 
compression at the most optimal time.


Above needs tie in nicely with the simplification of not implementing 
compression at fs-level.



   One approach is
to create an empty file, chattr it to enable compression, then write
uncompressed data to it.  Nothing in userspace will ever know the file
is compressed, unless you explicitly call lsattr.

If you want to follow some other approach where userspace has one
interface to write the compressed data to a file and some other
interface to read the file uncompressed, you are likely in a world of
pain.
Why? If it is going to only be a few applications who know the file is 
compressed, and read it to get decompressed data, why would it be 
painful? What about introducing a new flag, O_COMPR which tells the 
kernel, btw, we want this file to be decompressed if it can be. It can 
fallback to O_RDONLY or something like that? That gets rid of the 
chattr ugliness.
This transparent decompression idea is based on our experience with 
HFS+. Apple uses the fs-attribute approach. OSX is able to compress 
application libraries at installation-time, apps remain blissfully 
unaware but get an extra boost in startup perf.


So in Linux, the package manager could compress .so files, textual data 
files, etc.



Assuming you use the chattr approach, that pretty much comes down to
adding compression support to ext4.  There have been old patches for
ext2 around that never got merged.  Reading up on the problems
encountered by those patches might be instructive.


Do you have subjects for these? When I googled for ext4 compression, I 
found http://code.google.com/p/e4z/ which doesn't seem to exist, and 
checking in my LKML archives gives too many false positives.


Thanks!
Dhaval

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/