Re: OS support for fault tolerance
On Fri, Feb 24, 2012 at 3:10 PM, Dieter BSD wrote: > Depends on what sort of work the machine is doing. If the job is > something that can be done again, you could simply try again, if > you still get different answers try a third machine or wade in and > start manually inspecting things until you find the problem. > If the job is time critical or you can't get the same inputs again, > then the machine needs to get it right the first time. How many > 9s of reliability do you need and how many resources can you throw > at it? 2x hardware can be good for better than 5 9s. (high quality > hardware and software, and technicians standing by with cold spares) > I've heard that mil gear uses 3x hardware. > > Building a 5 9s system is... non-trivial. So I'm wondering what sort > of reliability we can get with 2x off the shelf commodity hardware > and a bit of software? Similar to mirroring/RAID but with whole > computers rather than just disks. Classic Unix technique of doing > 10-20% of the work and getting 80-90% of the result. > I don't have anything particularly insightful to add to this conversation, but it is something I've looked into a bit. The solution which seemed most promising to me is Remus. I don't know if any have heard of it so I offer a link: http://static.usenix.org/event/nsdi08/tech/full_papers/cully/cully_html/ I understand this doesn't correlate exactly with the OP's point but there is good material there regardless. -- Adam Vande More ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
>> The problem then is how to feed both machines the same inputs, and >> compare the outputs. Do we need a third machine to supervise? >> Can we have each machine keep an eye on the other, avoiding the >> need for a third machine? > > A pair would work as long as the only failures are "obvious" (e.g. > crashes). If they simply disagree as to the result, how would we > determine which one was right? Depends on what sort of work the machine is doing. If the job is something that can be done again, you could simply try again, if you still get different answers try a third machine or wade in and start manually inspecting things until you find the problem. If the job is time critical or you can't get the same inputs again, then the machine needs to get it right the first time. How many 9s of reliability do you need and how many resources can you throw at it? 2x hardware can be good for better than 5 9s. (high quality hardware and software, and technicians standing by with cold spares) I've heard that mil gear uses 3x hardware. Building a 5 9s system is... non-trivial. So I'm wondering what sort of reliability we can get with 2x off the shelf commodity hardware and a bit of software? Similar to mirroring/RAID but with whole computers rather than just disks. Classic Unix technique of doing 10-20% of the work and getting 80-90% of the result. >> Which then leads to the issue of how to avoid problems when *it* >> breaks. > > For some reason, this reminds me of a Dr. Seuss story: > http://www.goodreads.com/review/show/49519038 *grin* Gotta love Dr. Seuss. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 2/20/12 6:32 AM, Da Rock wrote: On 02/15/12 03:25, Brandon Falk wrote: On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't see it as unfeasable. The overhead for all of the error checking and redundancy makes this idea pretty impractical. You'd have to have 2 cores to do the exact same thing, then some 'master' core that makes sure they're doing the right stuff, and if you really want to think about it... what if the core monitoring the cores fails... there's a threshold of when redundancy gets pointless. Make no mistake here, I'm not really up with the guts of what this would require (the dog may not hunt at all). Consider me as the little boy throwing rocks at a hornets nest :) That out of the way, how about this scenario: why can't the master be dynamic amongst the cores? 1 core be the master of any 2 cores (not itself). Another thought (probably more scifi then anything else) is about using the cores as individuals which work as a team and fire a weak team member that is failing. I have absolutely no idea how to accomplish this, but I thought it might fire a few neurons in someone who does... :) There are so many reasons this would be ineffective on standard hardware I have no idea where to begin, but see my email above.. Perhaps I'm missing out on something, but you can't check the checker (without infinite redundancy). Honestly, if you're worried about a core failing, please take your server cluster out of the 1000 deg C forge. -Brandon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
"Dieter BSD" wrote: > The problem then is how to feed both machines the same inputs, and > compare the outputs. ??Do we need a third machine to supervise? > Can we have each machine keep an eye on the other, avoiding the > need for a third machine? A pair would work as long as the only failures are "obvious" (e.g. crashes). If they simply disagree as to the result, how would we determine which one was right? > Which then leads to the issue of how to avoid problems when *it* > breaks. For some reason, this reminds me of a Dr. Seuss story: http://www.goodreads.com/review/show/49519038 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
Rayson writes: > The question is, are we planning to handle >95% of the errors for >99% > of the hardware we run on, or are we really planning to spend years > trying to design something that would require special hardware > support? I assume this started as: "Oh look, most CPUs have multiple cores these days, maybe we could play with fault tolerance". Which could be useful if CPU cores failed a lot, but in reality what fails is disks, disks, controllers, disks, random other things, and disks. Assuming you have avoided the garbage-quality stuff, and have the system on a UPS. If you have enough ports you can add more disks and mirror or some other version of RAID. The next step is to duplicate everything. Not by looking for a mainboard with redundant everything, but by simply adding another computer. And rather than getting two of the same machine, you're better off if they are different, so that they don't have the same bugs. The problem then is how to feed both machines the same inputs, and compare the outputs. Do we need a third machine to supervise? Which then leads to the issue of how to avoid problems when *it* breaks. Can we have each machine keep an eye on the other, avoiding the need for a third machine? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 02/15/12 03:25, Brandon Falk wrote: On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't see it as unfeasable. The overhead for all of the error checking and redundancy makes this idea pretty impractical. You'd have to have 2 cores to do the exact same thing, then some 'master' core that makes sure they're doing the right stuff, and if you really want to think about it... what if the core monitoring the cores fails... there's a threshold of when redundancy gets pointless. Make no mistake here, I'm not really up with the guts of what this would require (the dog may not hunt at all). Consider me as the little boy throwing rocks at a hornets nest :) That out of the way, how about this scenario: why can't the master be dynamic amongst the cores? 1 core be the master of any 2 cores (not itself). Another thought (probably more scifi then anything else) is about using the cores as individuals which work as a team and fire a weak team member that is failing. I have absolutely no idea how to accomplish this, but I thought it might fire a few neurons in someone who does... :) Perhaps I'm missing out on something, but you can't check the checker (without infinite redundancy). Honestly, if you're worried about a core failing, please take your server cluster out of the 1000 deg C forge. -Brandon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 2/14/12 3:51 PM, Jan Mikkelsen wrote: Coming back to the multicore issue: The problem when a core fails is that it has affected more than its own state. It will be holding locks on shared resources and may have corrupted shared memory or asked a device to do the wrong thing. By the time you detect a fault in a core, it is too late. Checkpointing to main memory means that you need to be able to roll back to a checkpoint, and replay operations you know about. That involves more that CPU core state, that includes process file and device state. I think that/s more or less what I was saying but with more concrete examples. and yes I rememebr the tandem boxes from computer shows in Perth and Sydney, but never saw one in the field. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
Mirrored SMP? Even NonStops require a supervisory CPU subsystem to manage what is working or not. SMP itself would have to be totally rethought. My suggestion is to study the examples of NonStop and Guardian-90. Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" . ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
Brandon Falk wrote: On 2/14/2012 12:05 PM, Jason Hellenthal wrote: On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't see it as unfeasable. The overhead for all of the error checking and redundancy makes this idea pretty impractical. You'd have to have 2 cores to do the exact same thing, then some 'master' core that makes sure they're doing the right stuff, and if you really want to think about it... what if the core monitoring the cores fails... there's a threshold of when redundancy gets pointless. Perhaps I'm missing out on something, but you can't check the checker (without infinite redundancy). Honestly, if you're worried about a core failing, please take your server cluster out of the 1000 deg C forge. -Brandon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" Don't forget that cache would have to be redundant too. The redundant cores must not share an on-die cache. Oh, and the real biggie. What about the chipset and busses??? Those would NOT be redundant. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On Tue, Feb 14, 2012 at 6:01 PM, Julian Elischer wrote: > True, but you can't guarantee that a cpu is going to fail in a way that you > can detect like that. what if the clock just stops.. The question is, are we planning to handle >95% of the errors for >99% of the hardware we run on, or are we really planning to spend years trying to design something that would require special hardware support? On the zSeries mainframe, the instructions are executed in locked steps on the redundant instruction pipeline, and if the results don't match, the instruction is re-executed again. This happens on every load and store. Now, if you want software to do the same thing, you will need to somehow checkpoint the state of not only the processor, but the memory as well, or else if the bad processor stores something to memory you will still get corrupted data. Not only that the kernel becomes very complicated, it would make the system very slow. And what if the checkpointing code is executed by faulty processors?? IIRC, processors & disks don't usually just fail. That's the whole idea behind SMART, and Fault Management in Solaris & other kernels. http://hub.opensolaris.org/bin/view/Community+Group+fm/ Rayson = Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ > I believe that even those systems that > support cpu deactivation on > error only catch some percentage of the problems, and that sometimes it was > more of > "bring up the system without cpu X after it all crashed in flames". > > tandem and other systems in the old day s used to be able to cope with dying > cpus pretty well > but they had support from to to bottom and the software was written with > 'clustering' in mind. > > > > > > >> Rayson >> >> = >> Open Grid Scheduler / Grid Engine >> http://gridscheduler.sourceforge.net/ >> >> Scalable Grid Engine Support Program >> http://www.scalablelogic.com/ >> >>> ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" >>> ___ >>> freebsd-hackers@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>> To unsubscribe, send any mail to >>> "freebsd-hackers-unsubscr...@freebsd.org" >> >> >> > ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
RE: OS support for fault tolerance
> -Original Message- > From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd- > hack...@freebsd.org] On Behalf Of Julian Elischer > Sent: Tuesday, February 14, 2012 3:02 PM > To: Rayson Ho > Cc: Maninya M; freebsd-hackers@freebsd.org > Subject: Re: OS support for fault tolerance > > On 2/14/12 9:27 AM, Rayson Ho wrote: > > On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer wrote: > >> but I'm interested in any answers people may have > > The way other OSes handle this is by detecting any abnormal amounts of > > faults (sometimes it's not the fault of the hardware - eg. when a > > partical from the outerspace hits a core and flips the bit), then the > > disable the core(s). > > > > Solaris& mainframe (z/OS) handle it this way, but you should google > > and find more info since I don't remember all the details. > > > > Also, see this presentation: "Getting to know the Solaris Fault > > Management Architecture (FMA)": > > > http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation > .pdf > True, but you can't guarantee that a cpu is going to fail in a way > that you can detect like that. > what if the clock just stops.. I believe that even those systems that > support cpu deactivation on > error only catch some percentage of the problems, and that sometimes > it was more of > "bring up the system without cpu X after it all crashed in flames". > > tandem and other systems in the old day s used to be able to cope with > dying cpus pretty well > but they had support from to to bottom and the software was written > with 'clustering' in mind. > Nowadays NEC has a their sixth-generation "Fault Tolerant (FT) Series" servers which are pretty much like the tandem servers. We got a live demo of [simulated] CPU failure and the system kept chugging along. But as Julian says, it's not guaranteed that the CPU will always fail in a predictable way (however, NEC has produced a VERY nice redundant package with 256-bit backplane to keep everything nice and lock-step). -- Devin _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 15/02/2012, at 3:57 AM, Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: >> For multicore desktop computers, suppose one of the cores fails, the >> FreeBSD OS crashes. My question is about how I can make the OS tolerate >> this hardware fault. >> The strategy is to checkpoint the state of each core at specific intervals >> of time in main memory. Once a core fails, its previous state is retrieved >> from the main memory, and the processes that were running on it are >> rescheduled on the remaining cores. >> >> I read that the OS tolerates faults in large servers. I need to make it do >> this for a Desktop OS. I assume I would have to change the scheduler >> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >> How do I go about doing this? What exactly do I need to save for the >> "state" of the core? What else do I need to know? >> I have absolutely no experience with kernel programming or with FreeBSD. >> Any pointers to good sources about modifying the source-code of FreeBSD >> would be greatly appreciated. > This question has always intrigued me, because I'm always amazed > that people actually try. > From my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that 'multiply' > has suddenly started giving bad results now and then. > > if it just "stops" then you might be able to have a watchdog that > notices, but what do you do when it was half way through rearranging > a list of items? First, you have to find out that it held > the lock for the module and then you have to find out what it had > done and clean up the mess. > > This requires rewriting many many parts of the kernel to remove > 'transient inconsistent states". and even then, what do you do if it > was half way through manipulating some hardware.. > > and when you've figured that all out, how do you cope with the > mess it made because it was dying? > Say for example it had started calculating bad memory offsets > before writing out some stuff and written data out over random memory? > > but I'm interested in any answers people may have Back in the '90s I spent a bunch of time with looking at and using systems that dealt with this kind of failure. There are two basic approaches: With software support and without. The basic distinction is what the hardware can do when something breaks. Is it able to continue, or must it stop immediately? Tandem had systems with both approaches: The NonStop proprietary operating system had nodes with lock-step processors and lots of error checking that would stop immediately when something broke. A CPU failure turned into a node halt. There was a bunch of work to have nodes move their state around so that terminal sessions would not be interrupted, transactions would be rolled back, and everything would be in a consistent state. The Integrity Unix range was based on MIPS RISC/os, with a lot of work at Tandem. We had the R2000 and later the R3000 based systems. They had three CPUs all in lock step with voting ("triple modular redundancy"), and entirely duplicated memory, all with ECC. Redundant busses, separate cabinets for controllers and separate cabinets for each side of the disk mirror. You could pull out a CPU board and memory board, show a manager, and then plug them back in. Tandem claimed to have removed 80% of panics from the kernel, and changed the device driver architecture so that they could recover from some driver faults by reinitialising driver state on a running system. We still had some outages on this system, all caused by software. It was also expensive: AUD$1,000,000 for a system with the same underlying CPU/memory as a $30k MIPS workstation at the time. It was also slower because of the error checking overhead. However, it did crash much less than the MIPS boxes. Coming back to the multicore issue: The problem when a core fails is that it has affected more than its own state. It will be holding locks on shared resources and may have corrupted shared memory or asked a device to do the wrong thing. By the time you detect a fault in a core, it is too late. Checkpointing to main memory means that you need to be able to roll back to a checkpoint, and replay operations you know about. That involves more that CPU core state, that includes process file and device state. The Tandem lesson is that it much easier when you involve the higher level software in dealing with these issues. Building a system where you can make the application programmer ignorant of the need to deal with failure is much harder than when you expose units of work to the application programmer and can just fail a node and replay the work somewhere else. Transactions are your friend. Lots of literature on this stuff. My favourite is "Transaction Processing:
Re: OS support for fault tolerance
On 2/14/12 9:27 AM, Rayson Ho wrote: On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer wrote: but I'm interested in any answers people may have The way other OSes handle this is by detecting any abnormal amounts of faults (sometimes it's not the fault of the hardware - eg. when a partical from the outerspace hits a core and flips the bit), then the disable the core(s). Solaris& mainframe (z/OS) handle it this way, but you should google and find more info since I don't remember all the details. Also, see this presentation: "Getting to know the Solaris Fault Management Architecture (FMA)": http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf True, but you can't guarantee that a cpu is going to fail in a way that you can detect like that. what if the clock just stops.. I believe that even those systems that support cpu deactivation on error only catch some percentage of the problems, and that sometimes it was more of "bring up the system without cpu X after it all crashed in flames". tandem and other systems in the old day s used to be able to cope with dying cpus pretty well but they had support from to to bottom and the software was written with 'clustering' in mind. Rayson = Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 2012-02-14 18:13, Joshua Isom wrote: On 2/14/2012 10:57 AM, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have The only way I could see that it could be done, without direct hardware support, would be to virtualize it similar to how valgrind works. You'll take a speed hit bad enough to want to turn it off, but it could be possible. Testing that it works well could just mean overclocking your cpu until it starts crashing, and then seeing if it doesn't crash. Sun/Fujitsu SPARC64 CPUs has had "mainframe class" memory mirroring, End-to-end ECC protection, register ECC and hardware instruction retry for many years now - for the exact resaons that we discuss here - fault tolerance, (high) availability etc - typically these features are called RAS (Reliability, availability and serviceability) You can read more here: http://www.fujitsu.com/global/services/computing/server/sparcenterprise/technology/availability/processor.html /Uffe ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On Tue, Feb 14, 2012 at 12:05 PM, Jason Hellenthal wrote: > How about core redundancy ? effectively this would reduce the amount of > available cores in half in you spread a process to run on two cores at > the same time but with an option to adjust this per process etc... I > don't see it as unfeasable. There are a number of papers discussing core redundancy. They pretty much all work the same way: process the work on two different cores (or verify some subset of the work on the second core), and wait for both cores to return prior to the commit phase. One example: www.eecs.umich.edu/~taustin/papers/MICRO32-diva.pdf Another example: www.ee.duke.edu/~sorin/papers/ieeemicro08_argus.pdf These don't use existing cores on a multi-core chip, but instead use a "functional correctness" chip but I've seen designs that use the former as well. -- Eitan Adler ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer wrote: > but I'm interested in any answers people may have The way other OSes handle this is by detecting any abnormal amounts of faults (sometimes it's not the fault of the hardware - eg. when a partical from the outerspace hits a core and flips the bit), then the disable the core(s). Solaris & mainframe (z/OS) handle it this way, but you should google and find more info since I don't remember all the details. Also, see this presentation: "Getting to know the Solaris Fault Management Architecture (FMA)": http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf Rayson = Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ > > >> ___ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" >> > > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" -- Rayson == Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance (re-send)
(The email below did not show up on the online archive - resending...) -- Forwarded message -- From: Rayson Ho Date: Tue, Feb 14, 2012 at 12:27 PM Subject: Re: OS support for fault tolerance On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer wrote: > but I'm interested in any answers people may have The way other OSes handle this is by detecting any abnormal amounts of faults (sometimes it's not the fault of the hardware - eg. when a partical from the outerspace hits a core and flips the bit), then the disable the core(s). Solaris & mainframe (z/OS) handle it this way, but you should google and find more info since I don't remember all the details. Also, see this presentation: "Getting to know the Solaris Fault Management Architecture (FMA)": http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf Rayson = Open Grid Scheduler / Grid Engine http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ > > >> ___ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" >> > > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 2/14/2012 12:05 PM, Jason Hellenthal wrote: > > On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: >> On 2/14/12 6:23 AM, Maninya M wrote: >>> For multicore desktop computers, suppose one of the cores fails, the >>> FreeBSD OS crashes. My question is about how I can make the OS tolerate >>> this hardware fault. >>> The strategy is to checkpoint the state of each core at specific intervals >>> of time in main memory. Once a core fails, its previous state is retrieved >>> from the main memory, and the processes that were running on it are >>> rescheduled on the remaining cores. >>> >>> I read that the OS tolerates faults in large servers. I need to make it do >>> this for a Desktop OS. I assume I would have to change the scheduler >>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >>> How do I go about doing this? What exactly do I need to save for the >>> "state" of the core? What else do I need to know? >>> I have absolutely no experience with kernel programming or with FreeBSD. >>> Any pointers to good sources about modifying the source-code of FreeBSD >>> would be greatly appreciated. >> This question has always intrigued me, because I'm always amazed >> that people actually try. >> From my viewpoint, There's really not much you can do if the core >> that is currently holding the scheduler lock fails. >> And what do you mean by 'fails"? do you run constant diagnostics? >> how do you tell when it is failed? It'd be hard to detect that 'multiply' >> has suddenly started giving bad results now and then. >> >> if it just "stops" then you might be able to have a watchdog that >> notices, but what do you do when it was half way through rearranging >> a list of items? First, you have to find out that it held >> the lock for the module and then you have to find out what it had >> done and clean up the mess. >> >> This requires rewriting many many parts of the kernel to remove >> 'transient inconsistent states". and even then, what do you do if it >> was half way through manipulating some hardware.. >> >> and when you've figured that all out, how do you cope with the >> mess it made because it was dying? >> Say for example it had started calculating bad memory offsets >> before writing out some stuff and written data out over random memory? >> >> but I'm interested in any answers people may have >> > How about core redundancy ? effectively this would reduce the amount of > available cores in half in you spread a process to run on two cores at > the same time but with an option to adjust this per process etc... I > don't see it as unfeasable. > The overhead for all of the error checking and redundancy makes this idea pretty impractical. You'd have to have 2 cores to do the exact same thing, then some 'master' core that makes sure they're doing the right stuff, and if you really want to think about it... what if the core monitoring the cores fails... there's a threshold of when redundancy gets pointless. Perhaps I'm missing out on something, but you can't check the checker (without infinite redundancy). Honestly, if you're worried about a core failing, please take your server cluster out of the 1000 deg C forge. -Brandon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On 2/14/2012 10:57 AM, Julian Elischer wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have The only way I could see that it could be done, without direct hardware support, would be to virtualize it similar to how valgrind works. You'll take a speed hit bad enough to want to turn it off, but it could be possible. Testing that it works well could just mean overclocking your cpu until it starts crashing, and then seeing if it doesn't crash. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: >> >> For multicore desktop computers, suppose one of the cores fails, the >> FreeBSD OS crashes. My question is about how I can make the OS tolerate >> this hardware fault. >> The strategy is to checkpoint the state of each core at specific intervals >> of time in main memory. Once a core fails, its previous state is retrieved >> from the main memory, and the processes that were running on it are >> rescheduled on the remaining cores. >> >> I read that the OS tolerates faults in large servers. I need to make it do >> this for a Desktop OS. I assume I would have to change the scheduler >> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. >> How do I go about doing this? What exactly do I need to save for the >> "state" of the core? What else do I need to know? >> I have absolutely no experience with kernel programming or with FreeBSD. >> Any pointers to good sources about modifying the source-code of FreeBSD >> would be greatly appreciated. > > This question has always intrigued me, because I'm always amazed > that people actually try. > From my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. We did this at IBM after we'd done the dynamic logical partitioning. Basically, there was a way to probe the CPU for the number of correctable errors it was encountering. At too high a threshhold, it was considered "faulty" and we offlined the CPU before it encountered an uncorrectable error. We did the same thing for memory, too (that one I was directly involved in). The basic trouble, though, is that at least for memory, there didn't seem to be a correlation between the rate of correctable ECC and an uncorrectable error occurring. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that 'multiply' > has suddenly started giving bad results now and then. I'd assume this is predicated by the ability of the hardware to have some redundancy and some way to query the error rate. I've done a little work with memory ECC on the device driver end, and at least there hardware definitely reports correctable and uncorrectable ECC via some registers. But I don't know if there's any way to query this for a CPU (and of course each CPU would be different). However, all that said, it's a moderately large project to get an OS ready to handle things like holes appearing in its logical CPU ID space (how do you serialize this when you want the common case to not take a lock?), and to do all the wizardry of unscheduling (what do you do with a bound thread?) and then actually shutting the CPU down via firmware so it doesn't continue running. I started working on this for Linux when I worked at IBM, somewhere around 2004, and then IBM got sued by SCO so they pulled me off the project. It was finished up by a colleague and friend. You can probably come to a first approximation by forcing e.g. the idle thread to not get switched out, when the CPU appears unstable. Then at least it's running fewer instructions, and less likely to generate a machine check. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: OS support for fault tolerance
On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote: > On 2/14/12 6:23 AM, Maninya M wrote: > > For multicore desktop computers, suppose one of the cores fails, the > > FreeBSD OS crashes. My question is about how I can make the OS tolerate > > this hardware fault. > > The strategy is to checkpoint the state of each core at specific intervals > > of time in main memory. Once a core fails, its previous state is retrieved > > from the main memory, and the processes that were running on it are > > rescheduled on the remaining cores. > > > > I read that the OS tolerates faults in large servers. I need to make it do > > this for a Desktop OS. I assume I would have to change the scheduler > > program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. > > How do I go about doing this? What exactly do I need to save for the > > "state" of the core? What else do I need to know? > > I have absolutely no experience with kernel programming or with FreeBSD. > > Any pointers to good sources about modifying the source-code of FreeBSD > > would be greatly appreciated. > This question has always intrigued me, because I'm always amazed > that people actually try. > From my viewpoint, There's really not much you can do if the core > that is currently holding the scheduler lock fails. > And what do you mean by 'fails"? do you run constant diagnostics? > how do you tell when it is failed? It'd be hard to detect that 'multiply' > has suddenly started giving bad results now and then. > > if it just "stops" then you might be able to have a watchdog that > notices, but what do you do when it was half way through rearranging > a list of items? First, you have to find out that it held > the lock for the module and then you have to find out what it had > done and clean up the mess. > > This requires rewriting many many parts of the kernel to remove > 'transient inconsistent states". and even then, what do you do if it > was half way through manipulating some hardware.. > > and when you've figured that all out, how do you cope with the > mess it made because it was dying? > Say for example it had started calculating bad memory offsets > before writing out some stuff and written data out over random memory? > > but I'm interested in any answers people may have > How about core redundancy ? effectively this would reduce the amount of available cores in half in you spread a process to run on two cores at the same time but with an option to adjust this per process etc... I don't see it as unfeasable. -- ;s =; pgpugcwqBhE9F.pgp Description: PGP signature
Re: OS support for fault tolerance
On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the "state" of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. And what do you mean by 'fails"? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. if it just "stops" then you might be able to have a watchdog that notices, but what do you do when it was half way through rearranging a list of items? First, you have to find out that it held the lock for the module and then you have to find out what it had done and clean up the mess. This requires rewriting many many parts of the kernel to remove 'transient inconsistent states". and even then, what do you do if it was half way through manipulating some hardware.. and when you've figured that all out, how do you cope with the mess it made because it was dying? Say for example it had started calculating bad memory offsets before writing out some stuff and written data out over random memory? but I'm interested in any answers people may have ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"