Re: [PATCH] Remove process freezer from suspend to RAM pathway

Kyle Moffett Mon, 09 Jul 2007 20:05:30 -0700

Thanks for the detailed reply!

On Jul 09, 2007, at 22:07:15, Nigel Cunningham wrote:

On Friday 06 July 2007 15:01:48 Kyle Moffett wrote:
Suppose hibernate is implemented like this:
(1) Userspace program calls sys_freeze_processes()
(a) Pokes all CPUs with IPMIs and tells them to finish thecurrently running timeslot then stop(b) Atomically sends SIGSTOP to all userspace processes in anon-trappable way, except the calling process and any processwhich is ptracing it.
   (c) Returns to the calling process.
Ok. First, I'll ignore the specification that userspace does this -I don't think it matters whether it's userspace or kernel that doesthe suspending and I'm yet to see a good reason for it to be[required to be] done from userspace.

The reason it's _required_ to be done from userspace is thatuserspace is the only one which can figure out "These processes needto run for suspend to work", and then let those processes continuerunning after the freeze. The *ONLY* reason this even stopsprocesses at all is so we can do the post-device-mapper-snapshot codewith very little usably-free RAM (IE: only about 1MB for a standarddesktop system).

In this first step, you've reinvented the first part of the currentfreezer implementation. The reason we don't use a real signal isprecisely so we can have an untrappable SIGSTOP. In this regard, Iparticularly remember Win4Lin from a few years ago. It would die ifyou sent it a real signal, so we had to do it this way. No doubtthere are other instances I'm not aware of.

Well, you *do* want it to have semi-signal semantics, processes whichreceive it must not get back to userspace code so that they don'tstart allocating more memory when we're trying to do the freeze. Youalso don't want a process to be able to trap it (IE: like SIGSTOP orSIGKILL).

On the other hand, it should be delivered asynchronously (IE: Itdoesn't break an interruptable sleep or respond to most is-a-signal-present checks). You don't actually care if its sleeping in thekernel somewhere, just as long as it doesn't allocate much memory.

You would probably need a new signal "SIGFREEZE" which causes theprocess to be ignored as runnable the next time they schedule butnever actually gets delivered, and a "SIGUNFREEZE" which does thereverse. That way userspace could selectively resume processes basedon its policy of "this needs to run for hibernation".

(2) Userspace process sends SIGCONT to only those processes whichare necessary for sync and a device-mapper snapshot.
How do you determine which ones are needed?

It's userspace's job to know which ones are needed. For example, ifyou are hibernating over NFS then you need to resume the various NFS/RPC daemons and threads.

Why stop them in the first place?

So they aren't allocating memory when we are doing the device-mappersnapshot.

(3) Userspace calls sys_snapshot_kernel(snapshot_overhead_pages)
(a) Kernel starts freeing memory and swapping stuff out to makeroom for a copy of *kernel* memory (not pagecache, not processRAM). It does the same for at least snapshot_overhead_pagesextra (used by userspace later). It then allocates this memory tokeep it from going away. Since most processes are stopped wewon't have much else competing with us for the RAM.
Ok. So now you also need processes running that are needed forswapping, because freeing that memory might involve swapping. Fullyagree with the logic though (not really surprising - this is what Ido in Suspend2^wTuxOnIce).
(a) Kernel uses the device-mapper up-call-into-filesystemmachinery to get all mounted filesystems synced and ready for a DMsnapshot. This may include sending data via the userspaceprocesses resumed in (2). Any deadlocks here are userspace'sfault (see (2)). Will need some modification to handle doingmultiple blockdevs at a time. Anything using FUSE is basicallyperma-synced anyways (no dep-handling needed), and anything usingloop should already be handled by DM. This includes allocatingmemory for the basic snapshot
datastructures.
(b) At this point all blockdev operations should be halted anddisk caches flushed; that's all we care about.(c) Go through the device tree and quiesce DMA and shut offinterrupts. Since all the disks are synced this is easy.(d) Use IPMIs again to get all the CPUs together, which shouldbe easy as most processes are sleeping in IO or SIGSTOPed, andwe're getting no interrupts.(e) One CPU turns off all interrupts on itself and takes anatomic snapshot of kernel memory into the previously allocatedstorage. Once again, does not include pagecache. The kernel alsorecords a list of what pages *are* included in the pagecache. Itthen marks all userspace pages as copy-on-write.
Hotplugging cpus (when all those locking issues are taken care of)is simpler. Prior to cpu hotplugging, I used IMPIs to putsecondary cpus into a tight loop, so I know it's possible to do itthis way too.

It may be simpler, but it really screws up things like cpusets,processor affinity, etc. It also ties hibernation to the presentlyvery-flakey CPU-hotplug support, which is probably not what we want.

That way, though, you have less flexibility. What if a cpu reallyis plugged in between hibernate and resume? With cpu hotplugging,it's handled properly and transparently. Without cpu hotplugging,you could be using uninitialised data after the atomic restore.

IMHO if the user pulls a CPU while the box is hibernated, then he/shegets what he/she deserves. If you really want to support that, thenthe user must do the hotplug operation *manually* before suspending.Anything else is just going to be shooting ourselves in the footrepeatedly.

Marking userspace as COW makes things more complicated, too. Youthen have to add code to the COW handling to update the list ofpages that need to be saved, and you reduce the reliability of thewhole process. You can't predict beforehand how many of these COWpages are going to be needed, and therefore can't know how muchmemory to free earlier on in the process. If you run out of memory,what will be the effect?

You could pretty easily have a spare 128MB swap partition somewherewhich is not used during system operation but is "swapon"ed byuserspace after the COW snapshot to provide extra backing store.

(f) That CPU finalizes the modified DM snapshot using thepreviously-allocated memory.(g) That CPU frees up the snapshot_overhead_pages memoryallocated during step (a) for userspace to use.(h) The CPU does the equivalent of a "swapoff -a" withoutoverwriting any data already on any swap device(s).
You still need to remember what swap you're going to use to writethe image. You'll probably want to get this information (andallocate the swap) sooner rather than later so that you're notracing against the memory freeing earlier, and don't run intoissues with bmapping the pages or having enough memory to recordthe bdevs & sector numbers (not usually an issue, but if swap ishighly fragmented...).

Who says we have to use this swap to write the image? That may bethe default use-case, but it's certainly shouldn't be mandatory.Really, for the write-image-to-swap case you would just need topreallocate sufficient memory for the bmap tables beforehand, thenpopulate them at this phase.

   (i) The CPU then IPMI-signals the other CPUs to wake them up
(j) The kernel returns a FD-reference to the snapshot and theread-only halves of the CoW pagecache to the process which calledsys_snapshot_kernel().
Readonly halves? I don't get that, sorry.

Well, each page is copy-on-write, so the FD reference would alwaysprovide access to the original page data, whereas the processes mayend up copying the page so they can write to it. The trick would bethat shared pages need to remain shared between processes even afterthe copy-on-write. This is likely to be the trickiest part.

(4) The userspace process now has a reference to the copy of thekernel pages and the unmodified pagecache pages. Since 99% of theprocesses aren't running, we aren't going to be having to CoW manyof the pagecache pages.
Mmm, but you still don't know how many.

Yes, but at this point we're basically running a segment of userspacewith full kernel services available. Like I said above we can justadd a dedicated 128MB swap device to provide some spare backingstore. When you start running low on memory it might even page outto the swap device a part of the atomic copy of kernel memory.

(5) The userspace process uses read() or other syscalls to getdata out of the kernel-snapshot FD in small chunks, within itssnapshot_overhead_pages limit. It compresses these and writesthem out to the snapshot-storage blockdev (must not be mountedduring snapshot), or to any network server.
(6) The userspace process syncs the disks and halts the system.Any changed filesystem pages after the pseudo-DM-snapshot shouldhave been stored in semi-volatile storage somewhere and will bediscarded on the next reboot.
Are you thinking the changed filesystem pages are caught by COW?(AFAIUI, kernel writes aren't). If (as I expect), you're thinkingabout filesystem writes to DM based storage, what about non DM-based filesystem pages?

No, but a kernel write to a DM-snapshot calls the DM-snapshot codewhich copies the segment(s) to the snapshot device and modifies themthere. Basically disk-filesystem pagecache pages would be synced andprotected by the DM snapshot, while anonymous memory pages would beCoW-ed. And it might not be an unreasonable requirement to statethat the disk-based filesystems must all be mapped through DM-snapshot devices (even just straight 1:1 linear mappings), so thatthey can be trivially snapshotted.

So basically your hibernate-overhead would consist of:
(1) The pages necessary for the atomic snapshot of kernelmemory and the list of pagecache pages at that time(2) A little memory necessary for the kernel non-persistent DMsnapshot datastructures.
   (3) The snapshot_overhead_pages needed by userspace.
If you're using swap devices then you can save 99% of the state ofthe running kernel with an initial swapout overhead of virtuallynothing beyond the size of the unswappable kernel memory.
FWIW, let me note an important variation from how Suspend2 works;it might provide food for thought. In Suspend2, we treat theprocesses that remain stopped throughout the whole processspecially. We write their data to disk before the atomic copy(usually 70 or 80% of memory), and then use the memory they occupyfor the destination of the atomic copy. This further reduces theamount of memory that has to be freed, almost always to zero.

I suppose you could record swap-outs done to SIGFREEZEd processesspecially, so that they would be swapped in again before resuminguserspace. That would effectively result in the same thing.


Cheers,
Kyle Moffett


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Remove process freezer from suspend to RAM pathway

Reply via email to