Re: [fuse-devel] [PATCH] fuse: make fuse daemon frozen along with kernel threads
On Thu, Feb 07, 2013 at 10:59:19AM +0100, Miklos Szeredi wrote: > [CC list restored] > > On Thu, Feb 7, 2013 at 9:41 AM, Goswin von Brederlow > wrote: > > On Wed, Feb 06, 2013 at 10:27:40AM +0100, Miklos Szeredi wrote: > >> On Wed, Feb 6, 2013 at 2:11 AM, Li Fei wrote: > >> > > >> > There is well known issue that freezing will fail in case that fuse > >> > daemon is frozen firstly with some requests not handled, as the fuse > >> > usage task is waiting for the response from fuse daemon and can't be > >> > frozen. > >> > > >> > To solve the issue above, make fuse daemon frozen after all all user > >> > space processes frozen and during the kernel threads frozen phase. > >> > PF_FREEZE_DAEMON flag is added to indicate that the current thread is > >> > the fuse daemon, > >> > >> Okay and how do you know which thread, process or processes belong to > >> the "fuse daemon"? > > > > Maybe I'm talking about the wrong thing but isn't any process having > > /dev/fuse open "the fuse daemon"? And that doesn't even cover cases > > where one thread reads requests from the kernel and hands them to > > worker threads (that do not have /dev/fuse themself). Or the fuse > > request might need mysql to finish a request. > > > > I believe figuring out what processes handle fuse requests is a lost > > proposition. > > Pretty much. > > > > > > > Secondly how does freezing the daemon second garanty that it has > > replied to all pending requests? Or how is leaving it thawed the right > > decision? > > > > Instead the kernel side of fuse should be half frozen and stop sending > > out new requests. Then it should wait for all pending requests to > > complete. Then and only then can userspace processes be frozen safely. > > The problem with that is one fuse filesystem might be calling into > another. Say two fuse filesystems are mounted at /mnt/A and /mnt/B, > Process X starts a read request on /mnt/A. This is handled by process > A, which in turn starts a read request on /mnt/B, which is handled by > B. If we freeze the system at the moment when A starts handling the > request but hasn't yet sent the request to B then things wil be stuck > and the freezing will fail. > > So the order should be: Freeze the "topmost" fuse filesystems (A in > the example) first and if all pending requests are completed then > freeze the next ones in the hierarchy, etc... This would work if this > dependency between filesystems were known. But it's not and again it > looks like an impossible task. What is topmost? The kernel can't know that for sure. > The only way to *reliably* freeze fuse filesystems is to let it freeze > even if there are outstanding requests. But that's the hardest to > implement, because then it needs to allow freezing of tasks waiting on > i_mutex, for example, which is currently not possible. But this is > the only way I see that would not have unsolvable corner cases that > prop up at the worst moment. > > And yes, it would be prudent to wait some time for pending requests > to finish before freezing. But it's not a requirement, things > *should* work without that: suspending a machine is just like a very > long pause by the CPU, as long as no hardware is involved. And with > fuse filesystems there is no hardware involved directly by definition. > > But I'm open to ideas and at this stage I think even patches that > improve the situation for the majority of the cases would be > acceptable, since this is turning out to be a major problem for a lot > of people. > > Thanks, > Miklos For shutdown in userspace there is the sendsigs.omit.d/ to avoid the problem of halting/killing processes of the fuse filesystems (or other services) prematurely. I guess something similar needs to be done for freeze. The fuse filesystem has to tell the kernel what is up. MfG Goswin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: error: Eeek! page_mapcount(page) went negative! (-1) with different process and kernels
Arnaud Fontaine <[EMAIL PROTECTED]> writes: >> "Dave" == Dave Jones <[EMAIL PROTECTED]> writes: > > Dave> Many of these that I've seen have turned out to be a hardware > Dave> problem. Try running memtest86+ on that machine for a while. > Dave> It doesn't catch all problems, but it will highlight more > Dave> common memory faults. > > Hello, > > We ran memtest86+ before production, it was about one month ago. Do you > think it could come from that anyway? I find that a lot of the time memtest does not reveal an error. Only when you combine multiple sources or on random access do you get errors. For example compiling a kernel while doing heavy I/O on the disk. But that might just be me. Errors are rather random occurances. Compiling a kernel repeadatly and multiple in parallel is usualy a good test. If it sometimes fails to compile then it is near certain a hardware error. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote: >> When has free ever given any usefull "free" number? I can perfectly >> fine allocate another gigabyte of memory despide free saing 25MB. But >> that is because I know that the buffer/cached are not locked in. > > Well, as you said you know that buffer/cached are not locked in. If > /proc/meminfo would be rubbish like you seem to imply in the first > line, why would we ever bother to export that information and even > waste time writing a binary that parse it for admins? As a user I know it because I didn't put a kernel source into /tmp. A programm can't reasonably know that. >> On the other hand 1GB can instantly vanish when I start a xen domain >> and anything relying on the free value would loose. > > Actually you better check meminfo or free before starting a 1G of Xen!! Xen has its own memory pool and can quite agressively reclaim memory from dom0 when needed. I just ment to say that the number in /proc/meminfo can change in a second so it is not much use knowing what it said last minute. >> The only sensible thing for an application concerned with swapping is >> to whatch the swapping and then reduce itself. Not the amount >> free. Although I wish there were some kernel interface to get a >> preasure value of how valuable free pages would be right now. I would >> like that for fuse so a userspace filesystem can do caching without >> cripling the kernel. > > Repeated drop caches + free can help. I would kill any programm that does that to find out how much free ram the system has. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
[EMAIL PROTECTED] (Mel Gorman) writes: > On (17/09/07 00:38), Goswin von Brederlow didst pronounce: >> [EMAIL PROTECTED] (Mel Gorman) writes: >> >> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce: >> >> Mel Gorman <[EMAIL PROTECTED]> writes: >> >> Looking at my >> >> little test program evicting movable objects from a mixed group should >> >> not be that expensive as it doesn't happen often. >> > >> > It happens regularly if the size of the block you need to keep clean is >> > lower than min_free_kbytes. In the case of hugepages, that was always >> > the case. >> >> That assumes that the number of groups allocated for unmovable objects >> will continiously grow and shrink. > > They do grow and shrink. The number of pagetables in use changes for > example. By numbers of groups worth? And full groups get free, unmixed and filled by movable objects? >> I'm assuming it will level off at >> some size for long times (hours) under normal operations. > > It doesn't unless you assume the system remains in a steady state for it's > lifetime. Things like updatedb tend to throw a spanner into the works. Moved to cron weekly here. And even normally it is only once a day. So what if it starts moving some pages while updatedb runs. If it isn't too braindead it will reclaim some dentries updated has created and left for good. It should just cause the dentry cache to be smaller at no cost. I'm not calling that normal operations. That is a once a day special. What I don't want is to spend 1% of cpu time copying pages. That would be unacceptable. Copying 1000 pages per updatedb run would be trivial on the other hand. >> There should >> be some buffering of a few groups to be held back in reserve when it >> shrinks to prevent the scenario that the size is just at a group >> boundary and always grows/shrinks by 1 group. >> > > And what size should this group be that all workloads function? 1 is enough to prevent jittering. If you don't hold a group back and you are exactly at a group boundary then alternatingly allocating and freeing one page would result in a group allocation and freeing every time. With one group in reserve you only get an group allocation or freeing when a groupd worth of change has happened. This assumes that changing the type and/or state of a group is expensive. Takes time or locks or some such. Otherwise just let it jitter. >> >> So if >> >> you evict movable objects from mixed group when needed all the >> >> pagetable pages would end up in the same mixed group slowly taking it >> >> over completly. No fragmentation at all. See how essential that >> >> feature is. :) >> >> >> > >> > To move pages, there must be enough blocks free. That is where >> > min_free_kbytes had to come in. If you cared only about keeping 64KB >> > chunks free, it makes sense but it didn't in the context of hugepages. >> >> I'm more concerned with keeping the little unmovable things out of the >> way. Those are the things that will fragment the memory and prevent >> any huge pages to be available even with moving other stuff out of the >> way. > > That's fair, just not cheap That is the price you pay. To allocate 2MB of ram you have to have 2MB of free ram or make them free. There is no way around that. Moving pages means that you can actually get those 2MB even if the price is high and that you have more choice deciding what to throw away or swap out. I would rather have a 2MB malloc take some time than have it fail because the kernel doesn't feel like it. >> Can you tell me how? I would like to do the same. >> > > They were generated using trace_allocmap kernel module in > http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz > in combination with frag-display in the same package. However, in the > current version against current -mm's, it'll identify some movable pages > wrong. Specifically, it will appear to be mixing movable pages with slab > pages and it doesn't identify SLUB pages properly at all (SLUB came after > the last revision of this tool). I need to bring an annotation patch up to > date before it can generate the images correctly. Thanks. I will test that out and see what I get on a few lustre servers and clients. That is probably quite a different workload from what you test. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
[EMAIL PROTECTED] (Mel Gorman) writes: > On (16/09/07 23:58), Goswin von Brederlow didst pronounce: >> But when you already have say 10% of the ram in mixed groups then it >> is a sign the external fragmentation happens and some time should be >> spend on moving movable objects. >> > > I'll play around with it on the side and see what sort of results I get. > I won't be pushing anything any time soon in relation to this though. > For now, I don't intend to fiddle more with grouping pages by mobility > for something that may or may not be of benefit to a feature that hasn't > been widely tested with what exists today. I watched the videos you posted. A ncie and quite clear improvement with and without your logic. Cudos. When you play around with it may I suggest a change to the display of the memory information. I think it would be valuable to use a Hilbert Curve to arange the pages into pixels. Like this: # # 0 3 # # ### 1 2 ### ### 0 1 E F # # ### ### 3 2 D C # # # ### # 4 7 8 B # # # # ### ### 5 6 9 A +---+---+ # # # # |00 03 04 05|3A 3B 3C 3F| # # # # # # | | | ### ### ### ### |01 02 07 06|39 38 3D 3E| # # | | | ### ### ### ### |0E 0D 08 09|36 37 32 31| # # # # # # | | | # # # # |0F 0C 0B 0A|35 34 33 30| # # +-+-+ | ### ### ### |10 11|1E 1F|20 21 2E 2F| # # # # | | | | ### ### ### ### |13 12|1D 1C|23 22 2D 2C| # # # # | +-+ | # ### # # ### # |14 17|18 1B|24 27 28 2B| # # # # # # # # | | | | ### ### ### ### |15 16|19 1A|25 26 29 2A| +-+-+---+ I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in the last one. The idea is to get near pages visually near in the output and into an area instead of lines. Easier on the eye. It also manages to always draw aligned order(x) blocks as squares or rectanges (even or odd order). >> Maybe instead of reserving one could say that you can have up to 6 >> groups of space > > And if the groups are 1GB in size? I tried something like this already. > It didn't work out well at the time although I could revisit. You adjust group size with the number of groups total. You would not use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I think for most current systems we are lucky there. 2MB groups fit hardware support and give a large but not too large number of groups to work with. But you only need to stick to hardware suitable group sizes for huge tlb support right? For better I/O and such you could have 512Kb groups if that size gives a reasonable number of groups total. >> not used by unmovable objects before aggressive moving >> starts. I don't quite see why you NEED reserving as long as there is >> enough space free alltogether in case something needs moving. > > hence, increase min_free_kbytes. Which is different from reserving a full group as it does not count fragmented space as lost. >> 1 group >> worth of space free might be plenty to move stuff too. Note that all >> the virtual pages can be stuffed in every little free space there is >> and reassembled by the MMU. There is no space lost there. >> > > What you suggest sounds similar to having a type MIGRATE_MIXED where you > allocate from when the preferred lists are full. It became a sizing > problem that never really worked out. As I said, I can try again. Not realy. I'm saying we should actively defragment mixed groups during allocation and always as little as possible when a certain level of external fragmentation is reached. A MIGRATE_MIXED sounds like giving up completly if things get bad enough. Compare it to a cheap network switch going into hub mode when its arp table runs full. If you ever had that then you know how bad that is. >> But until one tries one can't say. >> >> MfG >> Goswin >> >> PS: How do allocations pick groups? > > Using GFP flags to identify the type. That is the type of group, not which one. >> Could one use the oldest group >> dedicated to each MIGRATE_TYPE? > > Age is difficult to determine so probably not. Put the uptime as sort key into each group header on creation or type change. Then sort the partialy used groups by that key. A heap will do fine and be fast. >> Or lowest address for unmovable and >> highest address for movable? Something to better keep the two out of >> each other way. > > We bias the location of unmovable and reclaimable allocations already. It's > not done for movable because it wasn't necessary (as they are easily > reclaimed or moved anyway). Ex
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
[EMAIL PROTECTED] (Mel Gorman) writes: > On (16/09/07 23:31), Andrea Arcangeli didst pronounce: >> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote: >> Allocating ptes from slab is fairly simple but I think it would be >> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the >> nearby ptes in the per-task local pagetable tree, to reduce the number >> of locks taken and not to enter the slab at all for that. > > It runs the risk of pinning up to 60K of data per task that is unusable for > any other purpose. On average, it'll be more like 32K but worth keeping > in mind. Two things to both of you respectively. Why should we try to stay out of the pte slab? Isn't the slab exactly made for this thing? To efficiently handle a large number of equal size objects for quick allocation and dealocation? If it is a locking problem then there should be a per cpu cache of ptes. Say 0-32 ptes. If you run out you allocate 16 from slab. When you overflow you free 16 (which would give you your 64k allocations but in multiple objects). As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?) respectively. For the sbrk() and mmap() usage from glibc malloc() that would be fine as they grow linear and the mmap() call in glibc could be made to align to those chunks. But for a programm like rtorrent using mmap to bring in chunks of a 4GB file this looks desasterous. >> Infact we >> could allocate the 4 levels (or anyway more than one level) in one >> single alloc_pages(0) and track the leftovers in the mm (or similar). Personally I would really go with a per cpu cache. When mapping a page reserve 4 tables. Then you walk the tree and add entries as needed. And last you release 0-4 unused entries to the cache. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > You ignore one other bit, when "/usr/bin/free" says 1G is free, with > config-page-shift it's free no matter what, same goes for not mlocked > cache. With variable order page cache, /usr/bin/free becomes mostly a > lie as long as there's no 4k fallback (like fsblock). % free total used free sharedbuffers cached Mem: 13987841372956 25828 0 225224 321504 -/+ buffers/cache: 826228 572556 Swap: 1048568 201048548 When has free ever given any usefull "free" number? I can perfectly fine allocate another gigabyte of memory despide free saing 25MB. But that is because I know that the buffer/cached are not locked in. On the other hand 1GB can instantly vanish when I start a xen domain and anything relying on the free value would loose. The only sensible thing for an application concerned with swapping is to whatch the swapping and then reduce itself. Not the amount free. Although I wish there were some kernel interface to get a preasure value of how valuable free pages would be right now. I would like that for fuse so a userspace filesystem can do caching without cripling the kernel. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
[EMAIL PROTECTED] (Mel Gorman) writes: > On (16/09/07 17:08), Andrea Arcangeli didst pronounce: >> zooming in I see red pixels all over the squares mized with green >> pixels in the same square. This is exactly what happens with the >> variable order page cache and that's why it provides zero guarantees >> in terms of how much ram is really "free" (free as in "available"). >> > > This picture is not grouping pages by mobility so that is hardly a > suprise. This picture is not running grouping pages by mobility. This is > what the normal kernel looks like. Look at the videos in > http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based > compares to vanilla. These are from February when there was less control > over mixing blocks than there is today. > > In the current version mixing occurs in the lower blocks as much as possible > not the upper ones. So there are a number of mixed blocks but the number is > kept to a minimum. > > The number of mixed blocks could have been enforced as 0, but I felt it was > better in the general case to fragment rather than regress performance. > That may be different for large blocks where you will want to take the > enforcement steps. I agree that 0 is a bad value. But so is infinity. There should be some mixing but not a lot. You say "kept to a minimum". Is that actively done or already happens by itself. Hopefully the later which would be just splendid. >> With config-page-shift mmap works on 4k chunks but it's always backed >> by 64k or any other largesize that you choosed at compile time. And if But would mapping a random 4K page out of a file then consume 64k? That sounds like an awfull lot of internal fragmentation. I hope the unaligned bits and pices get put into a slab or something as you suggested previously. >> the virtual alignment of mmap matches the physical alignment of the >> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we >> could use the 62nd bit of the pte to use a 64k tlb (if future cpus >> will allow that). Nick also suggested to still set all ptes equal to >> make life easier for the tlb miss microcode. It is too bad that existing amd64 CPUs only allow such large physical pages. But it kind of makes sense to cut away a full level or page tables for the next bigger size each. >> > big you can make it. I don't think my system with 1GB ram would work >> > so well with 2MB order 0 pages. But I wasn't refering to that but to >> > the picture. >> >> Sure! 2M is sure way excessive for a 1G system, 64k most certainly >> too, of course unless you're running a db or a multimedia streaming >> service, in which case it should be ideal. rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the ocasional mplayer. I would mostly be concerned how rtorrents totaly random access of mmapped files negatively impacts such a 64k page system. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Linus Torvalds <[EMAIL PROTECTED]> writes: > On Sun, 16 Sep 2007, Jörn Engel wrote: >> >> My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc. >> which are pinned for their entire lifetime and another for regular >> files/inodes. One could take a three-way approach and have >> always-pinned, often-pinned and rarely-pinned. >> >> We won't get never-pinned that way. > > That sounds pretty good. The problem, of course, is that most of the time, > the actual dentry allocation itself is done before you really know which > case the dentry will be in, and the natural place for actually giving the > dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" > it with d_add() or d_instantiate(). > > But it turns out that most of the filesystems we care about already use a > special case of "d_add()" that *already* replaces the dentry with another > one in some cases: "d_splice_alias()". > > So I bet that if we just taught "d_splice_alias()" to look at the inode, > and based on the inode just re-allocate the dentry to some other slab > cache, we'd already handle a lot of the cases! > > And yes, you'd end up with the reallocation overhead quite often, but at > least it would now happen only when filling in a dentry, not in the > (*much* more critical) cached lookup path. > > Linus You would only get it for dentries that live long (or your prediction is awfully wrong) and then the reallocation amortizes over time if you will. :) MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
[EMAIL PROTECTED] (Mel Gorman) writes: > On (15/09/07 02:31), Goswin von Brederlow didst pronounce: >> Mel Gorman <[EMAIL PROTECTED]> writes: >> >> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote: >> >> Nick Piggin <[EMAIL PROTECTED]> writes: >> >> >> >> > In my attack, I cause the kernel to allocate lots of unmovable >> >> > allocations >> >> > and deplete movable groups. I theoretically then only need to keep a >> >> > small number (1/2^N) of these allocations around in order to DoS a >> >> > page allocation of order N. >> >> >> >> I'm assuming that when an unmovable allocation hijacks a movable group >> >> any further unmovable alloc will evict movable objects out of that >> >> group before hijacking another one. right? >> >> >> > >> > No eviction takes place. If an unmovable allocation gets placed in a >> > movable group, then steps are taken to ensure that future unmovable >> > allocations will take place in the same range (these decisions take >> > place in __rmqueue_fallback()). When choosing a movable block to >> > pollute, it will also choose the lowest possible block in PFN terms to >> > steal so that fragmentation pollution will be as confined as possible. >> > Evicting the unmovable pages would be one of those expensive steps that >> > have been avoided to date. >> >> But then you can have all blocks filled with movable data, free 4K in >> one group, allocate 4K unmovable to take over the group, free 4k in >> the next group, take that group and so on. You can end with 4k >> unmovable in every 64k easily by accident. >> > > As the mixing takes place at the lowest possible block, it's > exceptionally difficult to trigger this. Possible, but exceptionally > difficult. Why is it difficult? When user space allocates memory wouldn't it get it contiously? I mean that is one of the goals, to use larger continious allocations and map them with a single page table entry where possible, right? And then you can roughly predict where an munmap() would free a page. Say the application does map a few GB of file, uses madvice to tell the kernel it needs a 2MB block (to get a continious 2MB chunk mapped), waits for it and then munmaps 4K in there. A 4k hole for some unmovable object to fill. If you can then trigger the creation of an unmovable object as well (stat some file?) and loop you will fill the ram quickly. Maybe it only works in 10% but then you just do it 10 times as often. Over long times it could occur naturally. This is just to demonstrate it with malice. > As I have stated repeatedly, the guarantees can be made but potential > hugepage allocation did not justify it. Large blocks might. > >> There should be a lot of preassure for movable objects to vacate a >> mixed group or you do get fragmentation catastrophs. > > We (Andy Whitcroft and I) did implement something like that. It hooked into > kswapd to clean mixed blocks. If the caller could do the cleaning, it > did the work instead of kswapd. Do you have a graphic like http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg for that case? >> Looking at my >> little test program evicting movable objects from a mixed group should >> not be that expensive as it doesn't happen often. > > It happens regularly if the size of the block you need to keep clean is > lower than min_free_kbytes. In the case of hugepages, that was always > the case. That assumes that the number of groups allocated for unmovable objects will continiously grow and shrink. I'm assuming it will level off at some size for long times (hours) under normal operations. There should be some buffering of a few groups to be held back in reserve when it shrinks to prevent the scenario that the size is just at a group boundary and always grows/shrinks by 1 group. >> The cost of it >> should be freeing some pages (or finding free ones in a movable group) >> and then memcpy. > > Freeing pages is not cheap. Copying pages is cheaper but not cheap. To copy you need a free page as destination. Thats all I ment. Hopefully there will always be a free one and the actual freeing is done asynchronously from the copying. >> So if >> you evict movable objects from mixed group when needed all the >> pagetable pages would end up in the same mixed group slowly taking it >> over completly. No fragmentation at all. See how essential that >> feature is. :) >> > > To move pages, there must be enough blocks free. That is where > min_free_kbytes had to come in. If you cared only
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Jörn Engel <[EMAIL PROTECTED]> writes: > On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote: >> >> Movable? I rather assume all slab allocations aren't movable. Then >> slab defrag can try to tackle on users like dcache and inodes. Keep in >> mind that with the exception of updatedb, those inodes/dentries will >> be pinned and you won't move them, which is why I prefer to consider >> them not movable too... since there's no guarantee they are. > > I have been toying with the idea of having seperate caches for pinned > and movable dentries. Downside of such a patch would be the number of > memcpy() operations when moving dentries from one cache to the other. > Upside is that a fair amount of slab cache can be made movable. > memcpy() is still faster than reading an object from disk. How probable is it that the dentry is needed again? If you copy it and it is not needed then you wasted time. If you throw it out and it is needed then you wasted time too. Depending on the probability one of the two is cheaper overall. Idealy I would throw away dentries that haven't been accessed recently and copy recently used ones. How much of a systems ram is spend on dentires? How much on task structures? Does anyone have some stats on that? If it is <10% of the total ram combined then I don't see much point in moving them. Just keep them out of the way of users memory so the buddy system can work effectively. > Most likely the current reaction to such a patch would be to shoot it > down due to overhead, so I didn't pursue it. All I have is an old patch > to seperate never-cached from possibly-cached dentries. It will > increase the odds of freeing a slab, but provide no guarantee. > > But the point here is: dentries/inodes can be made movable if there are > clear advantages to it. Maybe they should? > > Jörn MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
[EMAIL PROTECTED] (Mel Gorman) writes: > On (15/09/07 14:14), Goswin von Brederlow didst pronounce: >> Andrew Morton <[EMAIL PROTECTED]> writes: >> >> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote: >> > >> >> While I agree with your concern, those numbers are quite silly. The >> >> chances of 99.8% of pages being free and the remaining 0.2% being >> >> perfectly spread across all 2MB large_pages are lower than those of SHA1 >> >> creating a collision. >> > >> > Actually it'd be pretty easy to craft an application which allocates seven >> > pages for pagecache, then one for , then seven for pagecache, >> > then >> > one for , etc. >> > >> > I've had test apps which do that sort of thing accidentally. The result >> > wasn't pretty. >> >> Except that the applications 7 pages are movable and the >> would have to be unmovable. And then they should not share the same >> memory region. At least they should never be allowed to interleave in >> such a pattern on a larger scale. >> > > It is actually really easy to force regions to never share. At the > moment, there is a fallback list that determines a preference for what > block to mix. > > The reason why this isn't enforced is the cost of moving. On x86 and > x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of > those pages to prevent any mixing would be bad enough. On PowerPC, it's > potentially 16MB. On IA64, it's 1GB. > > As this was fragmentation avoidance, not guarantees, the decision was > made to not strictly enforce the types of pages within a block as the > cost cannot be made back unless the system was making agressive use of > large pages. This is not the case with Linux. I don't say the group should never be mixed. The movable objects could be moved out on demand. If 64k get allocated then up to 64k get moved. That would reduce the impact as the kernel does not hang while it moves 2MB or even 1GB. It also allows objects to be freed and the space reused in the unmovable and mixed groups. There could also be a certain number or percentage of mixed groupd be allowed to further increase the chance of movable objects freeing themself from mixed groups. But when you already have say 10% of the ram in mixed groups then it is a sign the external fragmentation happens and some time should be spend on moving movable objects. >> The only way a fragmentation catastroph can be (proovable) avoided is >> by having so few unmovable objects that size + max waste << ram >> size. The smaller the better. Allowing movable and unmovable objects >> to mix means that max waste goes way up. In your example waste would >> be 7*size. With 2MB uper order limit it would be 511*size. >> >> I keep coming back to the fact that movable objects should be moved >> out of the way for unmovable ones. Anything else just allows >> fragmentation to build up. >> > > This is easily achieved, just really really expensive because of the > amount of copying that would have to take place. It would also compel > that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely > MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is > a lot of free memory to keep around which is why fragmentation avoidance > doesn't do it. In your sample graphics you had 1152 groups. Reserving a few of those doesnt sound too bad. And how many migrate types do we talk about. So far we only had movable and unmovable. I would split unmovable into short term (caches, I/O pages) and long term (task structures, dentries). Reserving 6 groups for schort term unmovable and long term unmovable would be 1% of ram in your situation. Maybe instead of reserving one could say that you can have up to 6 groups of space not used by unmovable objects before aggressive moving starts. I don't quite see why you NEED reserving as long as there is enough space free alltogether in case something needs moving. 1 group worth of space free might be plenty to move stuff too. Note that all the virtual pages can be stuffed in every little free space there is and reassembled by the MMU. There is no space lost there. But until one tries one can't say. MfG Goswin PS: How do allocations pick groups? Could one use the oldest group dedicated to each MIGRATE_TYPE? Or lowest address for unmovable and highest address for movable? Something to better keep the two out of each other way. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote: >> - Userspace allocates a lot of memory in those slabs. > > If with slabs you mean slab/slub, I can't follow, there has never been > a single byte of userland memory allocated there since ever the slab > existed in linux. This and other comments in your reply show me that you completly misunderstood what I was talking about. Look at http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg The red dots (pinned) are dentries, page tables, kernel stacks, whatever kernel stuff, right? The green dots (movable) are mostly userspace pages being mapped there, right? What I was refering too is that because movable objects (green dots) aren't moved out of a mixed group (the boxes) when some unmovable object needs space all the groups become mixed over time. That means the unmovable objects are spread out over all the ram and the buddy system can't recombine regions when unmovable objects free them. There will nearly always be some movable objects in the other buddy. The system of having unmovable and movable groups breaks down and becomes useless. I'm assuming here that we want the possibility of larger order pages for unmovable objects (large continiuos regions for DMA for example) than the smallest order user space gets (or any movable object). If mmap() still works on 4k page bounaries then those will fragment all regions into 4k chunks in the worst case. Obviously if userspace has a minimum order of 64k chunks then it will never break any region smaller than 64k chunks and will never cause a fragmentation catastroph. I know that is verry roughly your aproach (make order 0 bigger), and I like it, but it has some limits as to how big you can make it. I don't think my system with 1GB ram would work so well with 2MB order 0 pages. But I wasn't refering to that but to the picture. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote: >> I keep coming back to the fact that movable objects should be moved >> out of the way for unmovable ones. Anything else just allows > > That's incidentally exactly what the slab does, no need to reinvent > the wheel for that, it's an old problem and there's room for > optimization in the slab partial-reuse logic too. Just boost the order > 0 page size and use the slab to get the 4k chunks. The sgi/defrag > design is backwards. How does that help? Will slabs move objects around to combine two partially filled slabs into nearly full one? If not consider this: - You create a slab for 4k objects based on 64k compound pages. (first of all that wastes you a page already for the meta infos) - Something movable allocates a 14 4k page in there making the slab partially filled. - Something unmovable alloactes a 4k page making the slab mixed and full. - Repeat until out of memory. OR - Userspace allocates a lot of memory in those slabs. - Userspace frees one in every 15 4k chunks. - Userspace forks 1000 times causing an unmovable task structure to appear in 1000 slabs. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Andrew Morton <[EMAIL PROTECTED]> writes: > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[EMAIL PROTECTED]> wrote: > >> While I agree with your concern, those numbers are quite silly. The >> chances of 99.8% of pages being free and the remaining 0.2% being >> perfectly spread across all 2MB large_pages are lower than those of SHA1 >> creating a collision. > > Actually it'd be pretty easy to craft an application which allocates seven > pages for pagecache, then one for , then seven for pagecache, then > one for , etc. > > I've had test apps which do that sort of thing accidentally. The result > wasn't pretty. Except that the applications 7 pages are movable and the would have to be unmovable. And then they should not share the same memory region. At least they should never be allowed to interleave in such a pattern on a larger scale. The only way a fragmentation catastroph can be (proovable) avoided is by having so few unmovable objects that size + max waste << ram size. The smaller the better. Allowing movable and unmovable objects to mix means that max waste goes way up. In your example waste would be 7*size. With 2MB uper order limit it would be 511*size. I keep coming back to the fact that movable objects should be moved out of the way for unmovable ones. Anything else just allows fragmentation to build up. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Christoph Lameter <[EMAIL PROTECTED]> writes: > On Fri, 14 Sep 2007, Christoph Lameter wrote: > >> an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine > > s/1G/1T/ Sigh. > >> has 256 milllion 4k pages--and the unmovable ratios we see today it > > 256k for 1G. 256k == 64 pages for 1GB ram or 256k pages == 1Mb? MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Mel Gorman <[EMAIL PROTECTED]> writes: > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote: >> Nick Piggin <[EMAIL PROTECTED]> writes: >> >> > In my attack, I cause the kernel to allocate lots of unmovable allocations >> > and deplete movable groups. I theoretically then only need to keep a >> > small number (1/2^N) of these allocations around in order to DoS a >> > page allocation of order N. >> >> I'm assuming that when an unmovable allocation hijacks a movable group >> any further unmovable alloc will evict movable objects out of that >> group before hijacking another one. right? >> > > No eviction takes place. If an unmovable allocation gets placed in a > movable group, then steps are taken to ensure that future unmovable > allocations will take place in the same range (these decisions take > place in __rmqueue_fallback()). When choosing a movable block to > pollute, it will also choose the lowest possible block in PFN terms to > steal so that fragmentation pollution will be as confined as possible. > Evicting the unmovable pages would be one of those expensive steps that > have been avoided to date. But then you can have all blocks filled with movable data, free 4K in one group, allocate 4K unmovable to take over the group, free 4k in the next group, take that group and so on. You can end with 4k unmovable in every 64k easily by accident. There should be a lot of preassure for movable objects to vacate a mixed group or you do get fragmentation catastrophs. Looking at my little test program evicting movable objects from a mixed group should not be that expensive as it doesn't happen often. The cost of it should be freeing some pages (or finding free ones in a movable group) and then memcpy. With my simplified simulation it never happens so I expect it to only happen when the work set changes. >> > And it doesn't even have to be a DoS. The natural fragmentation >> > that occurs today in a kernel today has the possibility to slowly push out >> > the movable groups and give you the same situation. >> >> How would you cause that? Say you do want to purposefully place one >> unmovable 4k page into every 64k compund page. So you allocate >> 4K. First 64k page locked. But now, to get 4K into the second 64K page >> you have to first use up all the rest of the first 64k page. Meaning >> one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then >> will a new 64k chunk be broken and become locked. > > It would be easier early in the boot to mmap a large area and fault it > in in virtual address order then mlock every a page every 64K. Early in > the systems lifetime, there will be a rough correlation between physical > and virtual memory. > > Without mlock(), the most successful attack will like mmap() a 60K > region and fault it in as an attempt to get pagetable pages placed in > every 64K region. This strategy would not work with grouping pages by > mobility though as it would group the pagetable pages together. But even with mlock the virtual pages should still be movable. So if you evict movable objects from mixed group when needed all the pagetable pages would end up in the same mixed group slowly taking it over completly. No fragmentation at all. See how essential that feature is. :) > Targetted attacks on grouping pages by mobility are not very easy and > not that interesting either. As Nick suggests, the natural fragmentation > over long periods of time is what is interesting. > >> So to get the last 64k chunk used all previous 32k chunks need to be >> blocked and you need to allocate 32k (or less if more is blocked). For >> all previous 32k chunks to be blocked every second 16k needs to be >> blocked. To block the last of those 16k chunks all previous 8k chunks >> need to be blocked and you need to allocate 8k. For all previous 8k >> chunks to be blocked every second 4k page needs to be used. To alloc >> the last of those 4k pages all previous 4k pages need to be used. >> >> So to construct a situation where no continious 64k chunk is free you >> have to allocate - 64k - 32k - 16k - 8k - 4k (or there >> about) of memory first. Only then could you free memory again while >> still keeping every 64k page blocked. Does that occur naturally given >> enough ram to start with? >> > > I believe it's very difficult to craft an attack that will work in a > short period of time. An attack that worked on 2.6.22 as well may have > no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility > does it make it exceedingly hard to craft an attack unless the attacker > can mlock large amounts of memory. > >> >>
Re: [RFC PATCH] Add a 'minimal tree install' target
Chris Wedgwood <[EMAIL PROTECTED]> writes: > This is a somewhat rough first-pass at making a 'minimal tree' > installation target. This installs a partial source-tree which you > can use to build external modules against. It feels pretty unclean > but I'm not aware of a much better way to do some of this. > > This patch works for me, even when using O=. It probably > needs further cleanups. > > Comments? Ever looked at the debian packages and how they do it? They even split out common files and specific files from the kernel build. Saves some space if you build multiple flavours of the same kernel version. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sata & scsi suggestion for make menuconfig
Helge Hafting <[EMAIL PROTECTED]> writes: > Randy Dunlap wrote: >> On Fri, 7 Sep 2007 14:48:00 +0200 Folkert van Heusden wrote: >> >> >>> Hi, >>> >>> Maybe it is a nice enhancement for make menuconfig to more explicitly >>> give a pop-up or so when someone selects for example a sata controller >>> while no 'scsi-disk' support was selected? >>> >> >> I know that it's difficult to get people to read docs & help text, >> and maybe it is needed in more places, but CONFIG_ATA (SATA/PATA) >> help text says: >> >> NOTE: ATA enables basic SCSI support; *however*, >> 'SCSI disk support', 'SCSI tape support', or >> 'SCSI CDROM support' may also be needed, >> depending on your hardware configuration. Could one duplicate the configure options for scsi disk/tape/cdrom at that place? The text should then probably read SCSI/SATA disk support in both places. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_NOLINK for open()
Brent Casavant <[EMAIL PROTECTED]> writes: > My (limited) understanding of ptrace is that a parent-child > relationship is needed between the tracing process and the traced > process (at least that's what I gather from the man page). This > does give cause for concern, and I might have to see what can be > done to alleviate this concern. I fully realize that making this > design completely unassilable is a fools errand, but closing off > as many attack vectors as possible seems prudent. No relationship needed: strace -p MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Hi, Nick Piggin <[EMAIL PROTECTED]> writes: > In my attack, I cause the kernel to allocate lots of unmovable allocations > and deplete movable groups. I theoretically then only need to keep a > small number (1/2^N) of these allocations around in order to DoS a > page allocation of order N. I'm assuming that when an unmovable allocation hijacks a movable group any further unmovable alloc will evict movable objects out of that group before hijacking another one. right? > And it doesn't even have to be a DoS. The natural fragmentation > that occurs today in a kernel today has the possibility to slowly push out > the movable groups and give you the same situation. How would you cause that? Say you do want to purposefully place one unmovable 4k page into every 64k compund page. So you allocate 4K. First 64k page locked. But now, to get 4K into the second 64K page you have to first use up all the rest of the first 64k page. Meaning one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then will a new 64k chunk be broken and become locked. So to get the last 64k chunk used all previous 32k chunks need to be blocked and you need to allocate 32k (or less if more is blocked). For all previous 32k chunks to be blocked every second 16k needs to be blocked. To block the last of those 16k chunks all previous 8k chunks need to be blocked and you need to allocate 8k. For all previous 8k chunks to be blocked every second 4k page needs to be used. To alloc the last of those 4k pages all previous 4k pages need to be used. So to construct a situation where no continious 64k chunk is free you have to allocate - 64k - 32k - 16k - 8k - 4k (or there about) of memory first. Only then could you free memory again while still keeping every 64k page blocked. Does that occur naturally given enough ram to start with? Too see how bad fragmentation could be I wrote a little progamm to simulate allocations with the following simplified alogrithm: Memory management: - Free pages are kept in buckets, one per order, and sorted by address. - alloc() the front page (smallest address) out of the bucket of the right order or recursively splits the next higher bucket. - free() recursively tries to merge a page with its neighbour and puts the result back into the proper bucket (sorted by address). Allocation and lifetime: - Every tick a new page is allocated with random order. - The order is a triangle distribution with max at 0 (throw 2 dice, add the eyes, subtract 7, abs() the number). - The page is scheduled to be freed after X ticks. Where X is nearly a gaus curve centered at 0 and maximum at * 1.5. (What I actualy do is throw 8 dice and sum them up and shift the result.) Display: I start with a white window. Every page allocation draws a black box from the address of the page and as wide as the page is big (-1 pixel to give a seperation to the next page). Every page free draws a yellow box in place of the black one. Yellow to show where a page was in use at one point while white means the page was never used. As the time ticks the memory fills up. Quickly at first and then comes to a stop around 80% filled. And then something interesting happens. The yellow regions (previously used but now free) start drifting up. Small pages tend to end up in the lower addresses and big pages at the higher addresses. The memory defragments itself to some degree. http://mrvn.homeip.net/fragment/ Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k, 295296 16k, 176647 32k and 59064 64k allocations you get this: http://mrvn.homeip.net/fragment/256mb.png Simulating 1GB ram and after 5881185 ticks and 2116671 4k, 1645957 8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this: http://mrvn.homeip.net/fragment/1gb.png MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Nick Piggin <[EMAIL PROTECTED]> writes: > On Tuesday 11 September 2007 22:12, Jörn Engel wrote: >> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote: >> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote: >> > > 5. VM scalability >> > >Large block sizes mean less state keeping for the information being >> > >transferred. For a 1TB file one needs to handle 256 million page >> > >structs in the VM if one uses 4k page size. A 64k page size reduces >> > >that amount to 16 million. If the limitation in existing filesystems >> > >are removed then even higher reductions become possible. For very >> > >large files like that a page size of 2 MB may be beneficial which >> > >will reduce the number of page struct to handle to 512k. The >> > > variable nature of the block size means that the size can be tuned at >> > > file system creation time for the anticipated needs on a volume. >> > >> > The idea that there even _is_ a bug to fail when higher order pages >> > cannot be allocated was also brushed aside by some people at the >> > vm/fs summit. I don't know if those people had gone through the >> > math about this, but it goes somewhat like this: if you use a 64K >> > page size, you can "run out of memory" with 93% of your pages free. >> > If you use a 2MB page size, you can fail with 99.8% of your pages >> > still free. That's 64GB of memory used on a 32TB Altix. >> >> While I agree with your concern, those numbers are quite silly. The > > They are the theoretical worst case. Obviously with a non trivially > sized system and non-DoS workload, they will not be reached. I would think it should be pretty hard to have only one page out of each 2MB chunk allocated and non evictable (writeable, swappable or movable). Wouldn't that require some kernel driver to allocate all pages and then selectively free them in such a pattern as to keep one page per 2MB chunk? Assuming nothing tries to allocate a large chunk of ram while holding to many locks for the kernel to free it. >> chances of 99.8% of pages being free and the remaining 0.2% being >> perfectly spread across all 2MB large_pages are lower than those of SHA1 >> creating a collision. I don't see anyone abandoning git or rsync, so >> your extreme example clearly is the wrong one. >> >> Again, I agree with your concern, even though your example makes it look >> silly. > > It is not simply a question of once-off chance for an all-at-once layout > to fail in this way. Fragmentation slowly builds over time, and especially > if you do actually use higher-order pages for a significant number of > things (unlike we do today), then the problem will become worse. If you > have any part of your workload that is affected by fragmentation, then > it will cause unfragmented regions to eventually be used for fragmentation > inducing allocations (by definition -- if it did not, eg. then there would be > no fragmentation problem and no need for Mel's patches). It might be naive (stop me as soon as I go into dream world) but I would think there are two kinds of fragmentation: Hard fragments - physical pages the kernel can't move around Soft fragments - virtual pages/cache that happen to cause a fragment I would further assume most ram is used on soft fragments and that the kernel will free them up by flushing or swapping the data when there is sufficient need. With defragmentation support the kernel could prevent some flushings or swapping by moving the data from one physical page to another. But that would just reduce unneccessary work and not change the availability of larger pages. Further I would assume that there are two kinds of hard fragments: Fragments allocated once at start time and temporary fragments. At boot time (or when a module is loaded or something) you get a tiny amount of ram allocated that will remain busy for basically ever. You get some fragmentation right there that you can never get rid of. At runtime a lot of pages are allocated and quickly freed again. They get preferably positions in regions where there already is fragmentation. In regions where there are suitable sized holes already. They would only break a free 2MB chunk into smaller chunks if there is no small hole to be found. Now a trick I would use is to put kernel allocated pages at one end of the ram and virtual/cache pages at the other end. Small kernel allocs would find holes at the start of the ram while big allocs would have to move more to the middle or end of the ram to find a large enough hole. And virtual/cache pages could always be cleared out to free large continious chunks. Splitting the two types would prevent fragmentation of freeable and not freeable regions giving us always a large pool to pull compound pages from. One could also split the ram into regions of different page sizes, meaning that some large compound pages may not be split below a certain limit. E.g. some amount of ram would be reserved for chunk >=64k only.
Re: patch: improve generic_file_buffered_write() (2nd try 1/2)
Nick Piggin <[EMAIL PROTECTED]> writes: > Lustre should probably have to be ported over to write_begin/write_end in > order to use it too. With the patches in -mm, if a filesystem is still using > prepare_write/commit_write, the vm reverts to a safe path which avoids > the deadlock (and allows multi-seg io copies), but copies the data twice. Not quite relevant for the performance problem. The situation is like this: lustre servers <-lustre network protocol-> lustre client <-NFS-> desktop The NFSd problem is on the lustre client that only plays gateway. That is not to say that the lustre servers or desktop loose performance due to fragmenting writes too but it isn't that noticeable there. > OTOH, this is very likely to go upstream, so your filesystem will need to be > ported over sooner or later anyway. Lustre copies the ext3 source from the kernel, patches in some extra features and renames them during build. So one the one hand it always breaks whenever someone meddles with the ext3 code. On the other hand improvement for ext3 get picked up by lustre semi automatically. In this case lustre would get the begin_write() function from ext3 and use it. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch: improve generic_file_buffered_write() (2nd try 1/2)
Nick Piggin <[EMAIL PROTECTED]> writes: > On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote: >> Nick Piggin <[EMAIL PROTECTED]> writes: >> > So I believe the problem is that for a multi-segment iovec, we currently >> > prepare_write/commit_write once for each segment, right? We do this >> >> It is more complex. >> >> Currently a __grab_cache_page, a_ops->prepare_write, >> filemap_copy_from_user[_iovec] and a_ops->commit_write is done >> whenever we hit >> >> a) a page boundary > > This is required by the prepare_write/commit_write API. The write_begin > / write_end API is also a page-based one, but in future, we are looking > at having a more general API but we haven't completely decided on the > form yet. "perform_write" is one proposal you can look for. > >> b) a segment boundary > > This is done, as I said, because of the deadlock issue. While the issue is > more completely fixed in -mm, a special case for kernel memory (eg. nfsd) > is in the latest mainline kernels. Can you tell me where to get the fix from -mm? If it is completly fixed there then that could make our patch obsolete. >> Those two cases don't have to, and from the stats basically never, >> coincide. For NFSd this means we do this TWICE per segment and TWICE >> per page. > > The page boundary doesn't matter so much (well it does for other reasons, > but we've never been good at them...). The segment boundary means that > we aren't able to do block sized writes very well and end up doing a lot of > read-modify-write operations that could be avoided. Those are extremly costly for lustre. We have tested exporting a lustre filesystem to NFS. Without fixes we get 40MB/s and with the fixes it rises to nearly 200MB/s. That is a factor of 5 in speed. >> > because there is a nasty deadlock in the VM (copy_from_user being >> > called with a page locked), and copying multiple segs dramatically >> > increases the chances that one of these copies will cause a page fault >> > and thus potentially deadlock. >> >> What actually locks the page? Is it __grab_cache_page or >> a_ops->prepare_write? > > prepare_write must be given a locked page. Then that means __grab_cache_page does return a locked page because there is nothing between the two calls that would. >> Note that the patch does not change the number of copy_from_user calls >> being made nor does it change their arguments. If we need 2 (or more) >> segments to fill a page we still do 2 seperate calls to >> filemap_copy_from_user_iovec, both only spanning (part of) one >> segment. >> >> What the patch changes is the number of copy_from_user calls between >> __grab_cache_page and a_ops->commit_write. > > So you're doing all copy_from_user calls within a prepare_write? Then > you're increasing the chances of deadlock. If not, then you're breaking > the API contract. Actually due to a bug, as you noticed, we do the copy first and then prepare/write. But fixing that would indeed do multiple copies between prepare and commit. >> Copying a full PAGE_SIZE bytes from multiple segments in one go would >> be a further improvement if that is possible. >> >> > The fix you have I don't think can work because a filesystem must be >> > notified of the modification _before_ it has happened. (If I understand >> > correctly, you are skipping the prepare_write potentially until after >> > some data is copied?). >> >> Yes. We changed the order of copy_from_user calls and >> a_ops->prepare_write by mistake. We will rectify that and do the >> prepare_write for the full page (when possible) before copying the >> data into the page. > > OK, that is what used to be done, but the API is broken due to this > deadlock. write_begin/write_end fixes it properly. I'm verry interested in that fix. >> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but >> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did >> > you try a later kernel to see if it is fixed there? >> >> Later than 2.6.23-rc5? > > No it would be included earlier. The "segment_eq" check should be > allowing kernel writes (nfsd) to write multiple segments. If you have a > patch which changes this significantly, then it would indicate the > existing logic has a problem (or you've got a userspace application doing > the writev, which should be fixed by the write_begin patches in -mm). I've got userspace application doing the writev. To be exact 14% of the commits were saved by combinin
Re: patch: improve generic_file_buffered_write() (2nd try 1/2)
Nick Piggin <[EMAIL PROTECTED]> writes: > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but > also a workaround for the NFSD problem in git commit 29dbb3fc. Did > you try a later kernel to see if it is fixed there? I had a chance to look up that commit (git clone took a while so sorry for writing 2 mails). It is present in 2.6.23-rc5 so I already noticed it when merging our patch in 2.6.23-rc5. Upon closer reading of the patch though I see that it will indeed prevent writes by the nfsd to be split smaller than PAGE_SIZE and it will cause filemap_copy_from_user[_iovec] to be called with a source spanning multiple pages. So the commit 29dbb3fc should have a simmilar, slightly better even, gain for the nfsd and other kernel space segments. But it will not improve writes from user space, where ~14% of the commits were saved during a days work for me. Now I have a question about fault_in_pages_readable(). Can I call that for multiple pages and then call __grab_cache_page() without risking one of the pages from getting lost again and causing a deadlock? MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch: improve generic_file_buffered_write() (2nd try 1/2)
Nick Piggin <[EMAIL PROTECTED]> writes: > On Thursday 06 September 2007 03:41, Bernd Schubert wrote: > Minor nit: when resubmitting a patch, you should include everything > (ie. the full changelog of problem statement and fix description) in a > single mail. It's just a bit easier... Will do next time. > So I believe the problem is that for a multi-segment iovec, we currently > prepare_write/commit_write once for each segment, right? We do this It is more complex. Currently a __grab_cache_page, a_ops->prepare_write, filemap_copy_from_user[_iovec] and a_ops->commit_write is done whenever we hit a) a page boundary b) a segment boundary Those two cases don't have to, and from the stats basically never, coincide. For NFSd this means we do this TWICE per segment and TWICE per page. > because there is a nasty deadlock in the VM (copy_from_user being > called with a page locked), and copying multiple segs dramatically > increases the chances that one of these copies will cause a page fault > and thus potentially deadlock. What actually locks the page? Is it __grab_cache_page or a_ops->prepare_write? Note that the patch does not change the number of copy_from_user calls being made nor does it change their arguments. If we need 2 (or more) segments to fill a page we still do 2 seperate calls to filemap_copy_from_user_iovec, both only spanning (part of) one segment. What the patch changes is the number of copy_from_user calls between __grab_cache_page and a_ops->commit_write. Copying a full PAGE_SIZE bytes from multiple segments in one go would be a further improvement if that is possible. > The fix you have I don't think can work because a filesystem must be > notified of the modification _before_ it has happened. (If I understand > correctly, you are skipping the prepare_write potentially until after > some data is copied?). Yes. We changed the order of copy_from_user calls and a_ops->prepare_write by mistake. We will rectify that and do the prepare_write for the full page (when possible) before copying the data into the page. > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but > also a workaround for the NFSD problem in git commit 29dbb3fc. Did > you try a later kernel to see if it is fixed there? Later than 2.6.23-rc5? > Thanks, > Nick MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] FS block count, size and seek offset?
"David Brown" <[EMAIL PROTECTED]> writes: >> Why don't you use the existing fuse-unionfs? > > I thought about doing this but it would need to be modified somehow > and even then my users would look to me to fix issues and I don't like > trying to find hard bugs in other peoples code. > > Also, there's a lot of functionality that funionfs has but I don't > need and the extra code would get in the way attempting to modify or > debug issues. > > What I want is fairly specific and I've not seen anything out there to do it. > > Thanks, > - David Brown You can still read their code to see how they solved problems you have. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] FS block count, size and seek offset?
"David Brown" <[EMAIL PROTECTED]> writes: > I was looking at various file systems and how they return > stat.st_blocks and stat.st_size for directories and had some questions > on how a fuse filesystem is supposed to implement readdir with the > seek offset when trying to union two directories together. > > Is the offset a byte offset into the DIR *dp? or is it the struct > dirent size (which is relative based on the name of the file) into the > dir pointer? I think it is totaly your call what you store in it. Only requirement is that you never pass 0 to the filler function. off_t is 64bit so storing DIR* in it should be no problem. Or a pointer to struct UnionDIR { DIR *dp; struct UnionDIR *next; } > Also, if you want to be accurate when you stat a directory that's > unioned in the fuse file system how many blocks should one return? > Since each filesystem seems to return different values for size and > number of blocks for directories. I know I could just say that its not > supported with my filesystem built using fuse... but I'd like to at > least try to be accurate. You could add them up and round it to some common block size (if they differ). But I don't think it maters and nothing uses that info. What is more important is the link count. For example find uses the link count to know how many subdirs a directory has. Once it found that many it assumes there are no more dirs and saves on stat() calls. > Is it accurate to assume that the size or number of blocks returned > from a stat will be used to pass a seek offset? > > When does fuse use the seek offset? Afaik never. The offset is only stored for the next readdir call but never used inside fuse. > These are the number of blocks and size on an empty dir. > ext3 > size 4096 nblocks 8 > reiserfs > size 48 nblocks 0 > jfs > size 1 nblocks 0 > xfs > size 6 nblocks 0 > > Any help to figure out how to union two directories and return correct > values would be helpful. Why don't you use the existing fuse-unionfs? > Thanks, > - David Brown > > P.S. maybe a posix filesystem interface manual would be good? MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/