Re: CRAK: a process checkpoint/restart kernel module
On Fri, 25 May 2001, Pavel Machek wrote: > > Basically I took three steps to migrate a TCP socket. Assuming A and B > > are the two peers: > > > > 1. shutdown process A while keep B open > > 2. restart A and re-establish the socket which points to B > > 3 . change the socket on B to point to the new location of A > > This assumes both A and B are on same machine, right? No. They can be on different machines. That's why it's called "migration" :-) > > The problem is, during this stage, if B sends packets to A before 3 is > > complete, B's socket will get a RST. In the case of X, if you click or > > move cursor on A's window when A is being migrated, it will crash. > > > You might shutdown machine's networking between checkpoint and > restart. That way, packets are silently lost, and there's no RST to be > generated. > That's what virtual network interface could be used for. Packets sent to A can be queued or discarded, whatever, if we have the control at the interface level. Actually one PhD student in my department has been working on it, and CRAK is just part of the project. > I guess you can't checkpoint/restart when there's remote machine > involved. I was not thinking online games, I was thinking about > tuxracer (game on localhost). localhost is much easier, but the same problem still exists. > -- > I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." > Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] > Hua Zhong Central Research Facilities Department of Computer Science Columbia University New York, NY 10027 Email: [EMAIL PROTECTED] http://www.cs.columbia.edu/~huaz - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
Hi! > Please cc to me - I am currently off the list. Ok. > > One question: can crak be used for process migration (assuming nodes > > share filesystem)? (As in, node of > > cluster is going down so we checkpoint and resume on some other node?) > > Yes, as long as the resources (opened files) can be accessed on both > nodes. Good. > > PS: Can it checkpoint/restart X applications? I guess some games would > > be easier with ability to checkpoint ;-) > > Which means we need to support migrating network sockets. I added > TCP/IPv4 socket support this spring (currently for 2.2.19 and will port to > 2.4 shortly), and I tested migrating X. In certain cases I > successfully migrated some applications like Emacs, Acroread, etc, but > there is a prob lem. (The socket migration code has not been put online, > but I'd like to discuss how it works here) > > Basically I took three steps to migrate a TCP socket. Assuming A and B > are the two peers: > > 1. shutdown process A while keep B open > 2. restart A and re-establish the socket which points to B > 3 . change the socket on B to point to the new location of A This assumes both A and B are on same machine, right? > The problem is, during this stage, if B sends packets to A before 3 is > complete, B's socket will get a RST. In the case of X, if you click or > move cursor on A's window when A is being migrated, it will crash. You might shutdown machine's networking between checkpoint and restart. That way, packets are silently lost, and there's no RST to be generated. > One solution might be that freezing B when A is being migrated. There are > two ways to freeze B: > > 1) send a SIGSTOP to B and later SIGCONT it. It's simple to do but woul d > result in freezing the whole process, which is bad in certain cases (e.g., > the whole X server is stopped - the screen freezes). Assuming they are on same machine. > 2) freeze the socket only. I tried to set window sizes of B's socket to > zero, but it didn't work (I didn't try too hard though). I'd like to know > whether there i s a way to do so. You don't want to decrease window size, you want all packets silently discarded. > Unfortunately, even we use 1), it still doesn't solve the whole problem. > For exmaple, when the X connection is tunneled through ssh, you can only > freeze the sshd process, but packets are still sent to it when you click > on the server side, which will crash the connection as ell (at least for > my current implementation). One reason might be I didn't take care of > pending packets when I migrage a socket, but in fact, the real problem of > socket migration is that you don't know what would happen if the network > address is changed. Appliactions may depend on it (such a s FTP). A > virtual network interface should be provided to solve the problem > gracefully. > > As of migrating games, hmmm, here are my 2cents: > > 1) Most online games use UDP, and CRAK hasn't implemented UDP support. > It's much easier than TCP though. I guess you can't checkpoint/restart when there's remote machine involved. I was not thinking online games, I was thinking about tuxracer (game on localhost). Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
Hi! Please cc to me - I am currently off the list. Ok. One question: can crak be used for process migration (assuming nodes share filesystem)? (As in, node of cluster is going down so we checkpoint and resume on some other node?) Yes, as long as the resources (opened files) can be accessed on both nodes. Good. PS: Can it checkpoint/restart X applications? I guess some games would be easier with ability to checkpoint ;-) Which means we need to support migrating network sockets. I added TCP/IPv4 socket support this spring (currently for 2.2.19 and will port to 2.4 shortly), and I tested migrating X. In certain cases I successfully migrated some applications like Emacs, Acroread, etc, but there is a prob lem. (The socket migration code has not been put online, but I'd like to discuss how it works here) Basically I took three steps to migrate a TCP socket. Assuming A and B are the two peers: 1. shutdown process A while keep B open 2. restart A and re-establish the socket which points to B 3 . change the socket on B to point to the new location of A This assumes both A and B are on same machine, right? The problem is, during this stage, if B sends packets to A before 3 is complete, B's socket will get a RST. In the case of X, if you click or move cursor on A's window when A is being migrated, it will crash. EVIL SOLUTION You might shutdown machine's networking between checkpoint and restart. That way, packets are silently lost, and there's no RST to be generated. /EVIL One solution might be that freezing B when A is being migrated. There are two ways to freeze B: 1) send a SIGSTOP to B and later SIGCONT it. It's simple to do but woul d result in freezing the whole process, which is bad in certain cases (e.g., the whole X server is stopped - the screen freezes). Assuming they are on same machine. 2) freeze the socket only. I tried to set window sizes of B's socket to zero, but it didn't work (I didn't try too hard though). I'd like to know whether there i s a way to do so. You don't want to decrease window size, you want all packets silently discarded. Unfortunately, even we use 1), it still doesn't solve the whole problem. For exmaple, when the X connection is tunneled through ssh, you can only freeze the sshd process, but packets are still sent to it when you click on the server side, which will crash the connection as ell (at least for my current implementation). One reason might be I didn't take care of pending packets when I migrage a socket, but in fact, the real problem of socket migration is that you don't know what would happen if the network address is changed. Appliactions may depend on it (such a s FTP). A virtual network interface should be provided to solve the problem gracefully. As of migrating games, hmmm, here are my 2cents: 1) Most online games use UDP, and CRAK hasn't implemented UDP support. It's much easier than TCP though. I guess you can't checkpoint/restart when there's remote machine involved. I was not thinking online games, I was thinking about tuxracer (game on localhost). Pavel -- I'm [EMAIL PROTECTED] In my country we have almost anarchy and I don't care. Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
On Fri, 25 May 2001, Pavel Machek wrote: Basically I took three steps to migrate a TCP socket. Assuming A and B are the two peers: 1. shutdown process A while keep B open 2. restart A and re-establish the socket which points to B 3 . change the socket on B to point to the new location of A This assumes both A and B are on same machine, right? No. They can be on different machines. That's why it's called migration :-) The problem is, during this stage, if B sends packets to A before 3 is complete, B's socket will get a RST. In the case of X, if you click or move cursor on A's window when A is being migrated, it will crash. EVIL SOLUTION You might shutdown machine's networking between checkpoint and restart. That way, packets are silently lost, and there's no RST to be generated. /EVIL That's what virtual network interface could be used for. Packets sent to A can be queued or discarded, whatever, if we have the control at the interface level. Actually one PhD student in my department has been working on it, and CRAK is just part of the project. I guess you can't checkpoint/restart when there's remote machine involved. I was not thinking online games, I was thinking about tuxracer (game on localhost). localhost is much easier, but the same problem still exists. -- I'm [EMAIL PROTECTED] In my country we have almost anarchy and I don't care. Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] Hua Zhong Central Research Facilities Department of Computer Science Columbia University New York, NY 10027 Email: [EMAIL PROTECTED] http://www.cs.columbia.edu/~huaz - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
Please cc to me - I am currently off the list. On Wed, 23 May 2001, Pavel Machek wrote: > Hi!! > > One question: can crak be used for process migration (assuming nodes > share filesystem)? [As in, node of > cluster is going down so we checkpoint and resume on some other node?] Yes, as long as the resources (opened files) can be accessed on both nodes. > PS: Can it checkpoint/restart X applications? I guess some games would > be easier with ability to checkpoint ;-) Which means we need to support migrating network sockets. I added TCP/IPv4 socket support this spring (currently for 2.2.19 and will port to 2.4 shortly), and I tested migrating X. In certain cases I successfully migrated some applications like Emacs, Acroread, etc, but there is a problem. (The socket migration code has not been put online, but I'd like to discuss how it works here) Basically I took three steps to migrate a TCP socket. Assuming A and B are the two peers: 1. shutdown process A while keep B open 2. restart A and re-establish the socket which points to B 3. change the socket on B to point to the new location of A The problem is, during this stage, if B sends packets to A before 3 is complete, B's socket will get a RST. In the case of X, if you click or move cursor on A's window when A is being migrated, it will crash. One solution might be that freezing B when A is being migrated. There are two ways to freeze B: 1) send a SIGSTOP to B and later SIGCONT it. It's simple to do but would result in freezing the whole process, which is bad in certain cases (e.g., the whole X server is stopped - the screen freezes). 2) freeze the socket only. I tried to set window sizes of B's socket to zero, but it didn't work (I didn't try too hard though). I'd like to know whether there is a way to do so. Unfortunately, even we use 1), it still doesn't solve the whole problem. For exmaple, when the X connection is tunneled through ssh, you can only freeze the sshd process, but packets are still sent to it when you click on the server side, which will crash the connection as well (at least for my current implementation). One reason might be I didn't take care of pending packets when I migrage a socket, but in fact, the real problem of socket migration is that you don't know what would happen if the network address is changed. Appliactions may depend on it (such as FTP). A virtual network interface should be provided to solve the problem gracefully. As of migrating games, hmmm, here are my 2cents: 1) Most online games use UDP, and CRAK hasn't implemented UDP support. It's much easier than TCP though. 2) I am not sure of what the effect would be if we changed the network address. Most games requires you to join a group before you start, and maybe the group membership is based on network address. At last, there are a lot of work left to do to make process migration work truly reliably, and CRAK is still far from that. For example, what if an application depends on pid? What if a process uses temporary files (/tmp) which are not present on other nodes? Or what if an application deletes files that are still opened (evil programs like make)? Not all of these are possible, or possible without enough kernel cooperation. Particularly hard when CRAK is just a kernel module. I am still a lerner. I wrote CRAK mostly for fun, but I'd like to hear some advice from the kernel hacker community if people think it has some value. > -- > Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, > details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. > Hua Zhong Central Research Facilities Department of Computer Science Columbia University New York, NY 10027 Email: [EMAIL PROTECTED] http://www.cs.columbia.edu/~huaz - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
Hi!! > This project has been there for over one year, and I've got quite a few > emails asking about it. Before it becomes more reliable, I think letting > more people know about it is a good idea. Thanks to those who ever > pushed me on it :-) > > I guess many of you have already known about epckpt, a patch written > by Eduardo Pinheiro that adds process checkpoint/restart capability to the > Linux kernel. CRAK does the similar thing - in fact, I started this > project based on epckpt's code, but now they have been very different. One question: can crak be used for process migration (assuming nodes share filesystem)? [As in, node of cluster is going down so we checkpoint and resume on some other node?] Pavel PS: Can it checkpoint/restart X applications? I guess some games would be easier with ability to checkpoint ;-) -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
Hi!! This project has been there for over one year, and I've got quite a few emails asking about it. Before it becomes more reliable, I think letting more people know about it is a good idea. Thanks to those who ever pushed me on it :-) I guess many of you have already known about epckpt, a patch written by Eduardo Pinheiro that adds process checkpoint/restart capability to the Linux kernel. CRAK does the similar thing - in fact, I started this project based on epckpt's code, but now they have been very different. One question: can crak be used for process migration (assuming nodes share filesystem)? [As in, node of cluster is going down so we checkpoint and resume on some other node?] Pavel PS: Can it checkpoint/restart X applications? I guess some games would be easier with ability to checkpoint ;-) -- Philips Velo 1: 1x4x8, 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CRAK: a process checkpoint/restart kernel module
Please cc to me - I am currently off the list. On Wed, 23 May 2001, Pavel Machek wrote: Hi!! One question: can crak be used for process migration (assuming nodes share filesystem)? [As in, node of cluster is going down so we checkpoint and resume on some other node?] Yes, as long as the resources (opened files) can be accessed on both nodes. PS: Can it checkpoint/restart X applications? I guess some games would be easier with ability to checkpoint ;-) Which means we need to support migrating network sockets. I added TCP/IPv4 socket support this spring (currently for 2.2.19 and will port to 2.4 shortly), and I tested migrating X. In certain cases I successfully migrated some applications like Emacs, Acroread, etc, but there is a problem. (The socket migration code has not been put online, but I'd like to discuss how it works here) Basically I took three steps to migrate a TCP socket. Assuming A and B are the two peers: 1. shutdown process A while keep B open 2. restart A and re-establish the socket which points to B 3. change the socket on B to point to the new location of A The problem is, during this stage, if B sends packets to A before 3 is complete, B's socket will get a RST. In the case of X, if you click or move cursor on A's window when A is being migrated, it will crash. One solution might be that freezing B when A is being migrated. There are two ways to freeze B: 1) send a SIGSTOP to B and later SIGCONT it. It's simple to do but would result in freezing the whole process, which is bad in certain cases (e.g., the whole X server is stopped - the screen freezes). 2) freeze the socket only. I tried to set window sizes of B's socket to zero, but it didn't work (I didn't try too hard though). I'd like to know whether there is a way to do so. Unfortunately, even we use 1), it still doesn't solve the whole problem. For exmaple, when the X connection is tunneled through ssh, you can only freeze the sshd process, but packets are still sent to it when you click on the server side, which will crash the connection as well (at least for my current implementation). One reason might be I didn't take care of pending packets when I migrage a socket, but in fact, the real problem of socket migration is that you don't know what would happen if the network address is changed. Appliactions may depend on it (such as FTP). A virtual network interface should be provided to solve the problem gracefully. As of migrating games, hmmm, here are my 2cents: 1) Most online games use UDP, and CRAK hasn't implemented UDP support. It's much easier than TCP though. 2) I am not sure of what the effect would be if we changed the network address. Most games requires you to join a group before you start, and maybe the group membership is based on network address. At last, there are a lot of work left to do to make process migration work truly reliably, and CRAK is still far from that. For example, what if an application depends on pid? What if a process uses temporary files (/tmp) which are not present on other nodes? Or what if an application deletes files that are still opened (evil programs like make)? Not all of these are possible, or possible without enough kernel cooperation. Particularly hard when CRAK is just a kernel module. I am still a lerner. I wrote CRAK mostly for fun, but I'd like to hear some advice from the kernel hacker community if people think it has some value. -- Philips Velo 1: 1x4x8, 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. Hua Zhong Central Research Facilities Department of Computer Science Columbia University New York, NY 10027 Email: [EMAIL PROTECTED] http://www.cs.columbia.edu/~huaz - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CRAK: a process checkpoint/restart kernel module
This project has been there for over one year, and I've got quite a few emails asking about it. Before it becomes more reliable, I think letting more people know about it is a good idea. Thanks to those who ever pushed me on it :-) I guess many of you have already known about epckpt, a patch written by Eduardo Pinheiro that adds process checkpoint/restart capability to the Linux kernel. CRAK does the similar thing - in fact, I started this project based on epckpt's code, but now they have been very different. The major differences are: * CRAK is a kernel module (!!) * CRAK doesn't do any bookkeeping (thus no run time overhead) * CRAK uses different strategy to checkpoint parallel processes (user space vs kernel space, and signal vs semaphore) Moreover, I've successfully (in the sense of working for simple cases such as telnet) added network socket support. Due to some academic reasons I have not put this portion of code online, but I'll do so as soon as possible. The main website is at http://www.cs.columbia.edu/~huaz/research/crak.htm. It works for 2.2.19 and 2.4.4 (the latter is still beta). You can also learn more about checkpointing at http://www.checkpointing.org (maintained by Eduardo Pinheiro). Speaking of reliability, it's not 100% reliable. Originally I wanted to make it more reliable before annoucing it, and now I realized (and was convinced) that letting people know about it earlier could make this goal happen sooner. All comments/praise/criticism are welcome. Thanks. Hua Zhong Central Research Facilities Department of Computer Science Columbia University New York, NY 10027 Email: [EMAIL PROTECTED] http://www.cs.columbia.edu/~huaz - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CRAK: a process checkpoint/restart kernel module
This project has been there for over one year, and I've got quite a few emails asking about it. Before it becomes more reliable, I think letting more people know about it is a good idea. Thanks to those who ever pushed me on it :-) I guess many of you have already known about epckpt, a patch written by Eduardo Pinheiro that adds process checkpoint/restart capability to the Linux kernel. CRAK does the similar thing - in fact, I started this project based on epckpt's code, but now they have been very different. The major differences are: * CRAK is a kernel module (!!) * CRAK doesn't do any bookkeeping (thus no run time overhead) * CRAK uses different strategy to checkpoint parallel processes (user space vs kernel space, and signal vs semaphore) Moreover, I've successfully (in the sense of working for simple cases such as telnet) added network socket support. Due to some academic reasons I have not put this portion of code online, but I'll do so as soon as possible. The main website is at http://www.cs.columbia.edu/~huaz/research/crak.htm. It works for 2.2.19 and 2.4.4 (the latter is still beta). You can also learn more about checkpointing at http://www.checkpointing.org (maintained by Eduardo Pinheiro). Speaking of reliability, it's not 100% reliable. Originally I wanted to make it more reliable before annoucing it, and now I realized (and was convinced) that letting people know about it earlier could make this goal happen sooner. All comments/praise/criticism are welcome. Thanks. Hua Zhong Central Research Facilities Department of Computer Science Columbia University New York, NY 10027 Email: [EMAIL PROTECTED] http://www.cs.columbia.edu/~huaz - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/