Bug#358696: This is dangerous, please make it default to disabled
Hi fellows, 2006/7/29, Daniel Richard G. [EMAIL PROTECTED]: On Fri, 2006 Jul 28 16:12:35 -0300, Henrique de Moraes Holschuh wrote: ... Anyway, the disagreement comes down to this: Me: Keep the system minimally running, so that it powers off when the UPS cuts the power, so that it will turn on again when the power returns, given the default behavior and limitations of PC hardware. Do sensible steps to avoid data loss (stop the disks, etc.). Have this be the default, as PC users are the common case. You: Do a normal system shutdown. Rely on server-grade features (e.g. WOL packet from a networked UPS) to resume operation, or an On/Off state: ON BIOS setting (despite the problems associated with that). Have this be the default, as the risk of data loss from fragile storage media trumps that of system unavailability after an extended outage. Mr. Quette will have to decide this, but I don't think you've made a strong case for a power-cut being significantly detrimental to data or hardware. Yes, there are circumstances where this can happen, but these are exceptions to the rule. And in one well-known case (RAID arrays), the scripts can easily do something different. I think you'll take issue with the NUT documentation, then, as it specifically suggests this approach. I will. But maybe, perchance, the NUT docs don't suggest you do it unless you own hardware that cannot do it properly? I didn't read it yet. I'm getting the impression that hardware that cannot do it properly, as you mean it, includes most PCs and non-server machines. Your view carries the day if NUT's userbase is not mostly these. The point you're talking about is a long standing problem I haven't yet found a *perfect* solution for. As you have well stated both, hardware difference, the huge number of UPSs setup and bios default configuration make it hard (or impossible) to find The Solution. Just to avoid misunderstanding: NUT relies by default upon hardware to be halted, and (BIOS) configured to power on on AC restored. I'll thus leave the patch, but disable it in -2 (scheduled for release by tomorrow), referecing the present thread as a WARNING. When I'll get more time (too busy for the moment with NUT bridging to HAL, some major code rewrite and internal projects), I'll restart 2 sub project (NPS - NUT Packaging Standard, and QA - Quality Assurance: https://alioth.debian.org/pm/?group_id=30602) and try to find The Solution. While the former will focus on NUT integration (ie halt procedure), the latter will focus on reliability of the UPS poweroff and such things (like finding upstream workaround for dumb UPSs to address power races). Thank you both for your constructive feedback, and don't hesitate to add more comments. Arnaud -- Linux / Unix Expert - MGE UPS SYSTEMS - RD Dpt Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/ Debian Developer - http://people.debian.org/~aquette/ OpenSource Developer - http://arnaud.quette.free.fr/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#358696: This is dangerous, please make it default to disabled
On Fri, 28 Jul 2006, Daniel Richard G. wrote: On Fri, 2006 Jul 28 16:12:35 -0300, Henrique de Moraes Holschuh wrote: There is no tradeoff without the hack, and the hack is only needed in hardware unsuitable for UPS management. Thus, it must be optional. It is dangerous to data and the hardware, so it should not be the default. Define (un)suitable for UPS management. Does this definition include most people's desktop systems? Suitable for UPS management: Load: Powers up when AC returns Can be informed that it must shutdown by the UPS (through NUT). UPS: Does delayed load shutdown upon shutdown command Does not power up the load before it has enough charge to do a delayed shutdown, plus safety margin. Always power-cycles the load after a shutdown command is ACK'ed to the controlling host. Even if AC returns, and it doesn't need to shutdown anymore. Communicates the host when battery charge is below a certain threshold, so that it can shutdown safely. Powers up the load if the batteries have enough charge, and an AC cycle happens while the load is offline. Powers up the load after a timer expires, if no AC cycles happen AND the load was broght offline by an explicit delayed shutdown command. Anything else is unsuitable. Any PC97 desktop should be suitable for proper UPS management. And just FYI, PC97 requires WoL on all ethernet devices, not that you need WoL for a proper UPS setup, but you somehow got the idea that WoL was a server-grade feature... You have transient responses to power cuts. Watch in an osciloscope, computer hardware is not a resistive load. No, but any decent power supply will present a load pretty close to it, Only ones with PFC. production server-room environments.) If someone's got a rack setup where a UPS power cutoff will fry everything, they've got a much bigger problem than what we're discussing here. Yes. number of machines connected, but large numbers of machines connected are not exactly a typical scenario. No, but your hard-drive doing emergency unloads is a typical scenario, and desktop HDs don't like those unloads *at* *all*. Do not do it (and as I already said, the only proper way to know the HD heads are unloaded requires kernel cooperation, and it is NOT done by userspace currently). I know you were under the mistaken impression that we could guarantee all HD heads were unloaded in userspace, and before halt runs. We not only cannot do it, we also do not *attempt* to do it. The only thing in Debian initscripts that really tries to take care of HD head unloads is the halt command. You can, of course, try to make sure hdparm was run and actually uloaded all heads for your particular configuration, but it is not an acceptable default, because we cannot get it right every time. So implement it as an admin-enabled, admin-configured option by all means. But *not* as a default. All of which can be done (and already is, I believe). The only thing that the system is doing while waiting for poweroff is sleep 15m; reboot---no disks need to be spinning for that. If you did not call halt, plus told the kernel to shutdown the devices, no, it was *not* done. And the kernel is the *only* thing that really knows how to properly powerdown the devices. Currently, we cannot ask it to do so from userspace easily, and if we did, we could not access the disks anymore for example. We have hdparm -Y. We can't access the disk after that, but we shouldn't need to. What more shutdown magic do you need on a hard disk that is not spinning? None. If the disk spun down, but hdparm doesn't work for all disks. And we cannot reliably spin down all disks and uload heads from userspace, for all possible configurations. Thus, anything that relies on this cannot be made a default. If you're talking about a flaky hardware RAID array where you can't stop SCSI plus all software RAID arrays. The issue is how the initscript behaves if the NUT shutdown command doesn't kill everything to kingdon come in 5 seconds. In fact, a proper UPS is going to be programmed to actually *delay* the powerdown load command for enough time to allow the load to try to powerdown for real by itself. Assuming things are as I had in my patch, the idea is to have all machines connected to a given UPS configured with a similar wait-until-poweroff- else-reboot time (if they don't shutdown straightaway). The bad thing in your patch is that the maintainer made it non-optional, and the default. I understand it will not be a default anymore, which is enough for me. Anyway, the disagreement comes down
Bug#358696: This is dangerous, please make it default to disabled
On Mon, 2006 Jul 31 13:47:21 -0300, Henrique de Moraes Holschuh wrote: Define (un)suitable for UPS management. Does this definition include most people's desktop systems? Suitable for UPS management: Load: Powers up when AC returns Can be informed that it must shutdown by the UPS (through NUT). Okay, so pretty much anything that can run NUT. Nice. UPS: Does delayed load shutdown upon shutdown command Does not power up the load before it has enough charge to do a delayed shutdown, plus safety margin. Pretty basic stuff, yes. Always power-cycles the load after a shutdown command is ACK'ed to the controlling host. Even if AC returns, and it doesn't need to shutdown anymore. Many low-end UPSes fail here. Power races would be an academic issue if not for this. Communicates the host when battery charge is below a certain threshold, so that it can shutdown safely. Powers up the load if the batteries have enough charge, and an AC cycle happens while the load is offline. Powers up the load after a timer expires, if no AC cycles happen AND the load was broght offline by an explicit delayed shutdown command. Anything else is unsuitable. Hi, I'm Bob, and I have an unsuitable UPS. Can I use it with Debian? Any PC97 desktop should be suitable for proper UPS management. And just FYI, PC97 requires WoL on all ethernet devices, not that you need WoL for a proper UPS setup, but you somehow got the idea that WoL was a server-grade feature... Fair enough, but a UPS with an Ethernet port (and a means of configuring WoL) certainly is. If not in purpose, then in price. No, but any decent power supply will present a load pretty close to it, Only ones with PFC. Decent power supplies have PFC. number of machines connected, but large numbers of machines connected are not exactly a typical scenario. No, but your hard-drive doing emergency unloads is a typical scenario, and desktop HDs don't like those unloads *at* *all*. Do not do it (and as I already said, the only proper way to know the HD heads are unloaded requires kernel cooperation, and it is NOT done by userspace currently). I know you were under the mistaken impression that we could guarantee all HD heads were unloaded in userspace, and before halt runs. We not only cannot do it, we also do not *attempt* to do it. The only thing in Debian initscripts that really tries to take care of HD head unloads is the halt command. You can, of course, try to make sure hdparm was run and actually uloaded all heads for your particular configuration, but it is not an acceptable default, because we cannot get it right every time. So implement it as an admin-enabled, admin-configured option by all means. But *not* as a default. Perhaps the sleep-then-reboot loop belongs inside the halt command, then. At some point, there's going to be little difference between cutting power to the PSU, and having the PSU do a soft poweroff. need to. What more shutdown magic do you need on a hard disk that is not spinning? None. If the disk spun down, but hdparm doesn't work for all disks. And we cannot reliably spin down all disks and uload heads from userspace, for all possible configurations. Thus, anything that relies on this cannot be made a default. If you're talking about a flaky hardware RAID array where you can't stop SCSI plus all software RAID arrays. I mean _after_ mdadm is stopped. Not that any distinction is currently made between RAID setups, flaky or otherwise. The bad thing in your patch is that the maintainer made it non-optional, and the default. I understand it will not be a default anymore, which is enough for me. I agree that having it non-optional is undesirable. You: Do a normal system shutdown. Rely on server-grade features (e.g. WOL No. Me: make the whole behaviour you want *optional*, and not the default, because it is dangerous and we don't have a lick of a chance of making it safe for all setups. packet from a networked UPS) to resume operation, or an On/Off state: ON No. Rely on standard PC97 ACPI desktop BIOS option always power on on AC return, which is the correct way to deal with machines that need to restart when an UPS powers it up again. Correct? The PC will then always turn on when the AC returns, e.g. when being plugged in, or after a power outage when it was off to begin with. The PSU's hard power switch isn't a solution, either, as it is often inconvenient/inaccessible and many newer consumer PSUs don't even have one. The real solution is to have an on/off state bit that can be frobbed by the OS, but I'm not holding my breath
Bug#358696: This is dangerous, please make it default to disabled
On Fri, 2006 Jul 28 14:46:47 -0300, Henrique de Moraes Holschuh wrote: 1. The UPS may take more than 15 minutes to shutdown the load. You cannot assume things like this, and you will cause data loss if you get it wrong: the power-off could come with the system fully online. The time period should be configurable; I just suggested 15 minutes as a default. You could set a higher value, but the tradeoff is that if the power returns, the system is unavailable for that time period. 2. Not powering off the box by itself (read: allowing halt and the kernel to do its job and cut power cleanly) means it will be subject to high transients when the UPS shuts down the load. This will, in turn, make it worse for the other loads that have not been properly shut down. It would be a disaster in a server farm. Please elaborate on how server equipment is subjected to a transient when a UPS cuts power to it. (If anything, the situation is much worse when it is powered back on.) 3. Non-controlled shutdowns are *very* bad for all hardware, including desktop systems. For starters, all disks will be subject to emergency head unloads. The halt utility does a lot of work-around on kenrel bugs to make sure disks are parked, RAID arrays are in read-only mode or shutdown, etc for a damn good reason. All of which can be done (and already is, I believe). The only thing that the system is doing while waiting for poweroff is sleep 15m; reboot---no disks need to be spinning for that. 4. It is very probable that in any non-home scenarios, an UPS will protect more than one equipment. In those scenarios, the UPS is configured to NOT accept immediate shutdown the load command from any of the equipments, just from the main controller host. Nut is geared to work fine and specifically support such configurations. This has to be taken into account. Isn't this already the case for non-networked UPSes? When the interface is serial or USB, it can only be connected to (and controlled by) a single, master host. Thus, implementing the work around proposed in this bug report as a default behaviour is not acceptable. Please revert the change, or make it optional, and *not* enabled by default. I would go even further and actively discourage heavily the use of this option, as it can damage the hardware. I think you'll take issue with the NUT documentation, then, as it specifically suggests this approach. --Daniel -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#358696: This is dangerous, please make it default to disabled
On Fri, 28 Jul 2006, Daniel Richard G. wrote: The time period should be configurable; I just suggested 15 minutes as a default. You could set a higher value, but the tradeoff is that if the power returns, the system is unavailable for that time period. There is no tradeoff without the hack, and the hack is only needed in hardware unsuitable for UPS management. Thus, it must be optional. It is dangerous to data and the hardware, so it should not be the default. It is fairly simple, really, unless I missed something major (which is always possible). 2. Not powering off the box by itself (read: allowing halt and the kernel to do its job and cut power cleanly) means it will be subject to high transients when the UPS shuts down the load. This will, in turn, make it worse for the other loads that have not been properly shut down. It would be a disaster in a server farm. Please elaborate on how server equipment is subjected to a transient when a UPS cuts power to it. (If anything, the situation is much worse when it is powered back on.) You have transient responses to power cuts. Watch in an osciloscope, computer hardware is not a resistive load. The situation is bad when everything powers up at the same time too, yes. That's why it isn't all powered up at once in server rooms, blade enclosures, etc. 3. Non-controlled shutdowns are *very* bad for all hardware, including desktop systems. For starters, all disks will be subject to emergency head unloads. The halt utility does a lot of work-around on kenrel bugs to make sure disks are parked, RAID arrays are in read-only mode or shutdown, etc for a damn good reason. All of which can be done (and already is, I believe). The only thing that the system is doing while waiting for poweroff is sleep 15m; reboot---no disks need to be spinning for that. If you did not call halt, plus told the kernel to shutdown the devices, no, it was *not* done. And the kernel is the *only* thing that really knows how to properly powerdown the devices. Currently, we cannot ask it to do so from userspace easily, and if we did, we could not access the disks anymore for example. Isn't this already the case for non-networked UPSes? When the interface is serial or USB, it can only be connected to (and controlled by) a single, master host. The issue is how the initscript behaves if the NUT shutdown command doesn't kill everything to kingdon come in 5 seconds. In fact, a proper UPS is going to be programmed to actually *delay* the powerdown load command for enough time to allow the load to try to powerdown for real by itself. Thus, implementing the work around proposed in this bug report as a default behaviour is not acceptable. Please revert the change, or make it optional, and *not* enabled by default. I would go even further and actively discourage heavily the use of this option, as it can damage the hardware. I think you'll take issue with the NUT documentation, then, as it specifically suggests this approach. I will. But maybe, perchance, the NUT docs don't suggest you do it unless you own hardware that cannot do it properly? I didn't read it yet. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#358696: This is dangerous, please make it default to disabled
On Fri, 2006 Jul 28 16:12:35 -0300, Henrique de Moraes Holschuh wrote: There is no tradeoff without the hack, and the hack is only needed in hardware unsuitable for UPS management. Thus, it must be optional. It is dangerous to data and the hardware, so it should not be the default. Define (un)suitable for UPS management. Does this definition include most people's desktop systems? You have transient responses to power cuts. Watch in an osciloscope, computer hardware is not a resistive load. No, but any decent power supply will present a load pretty close to it, making such a transient negligible. (I know this to be the case in production server-room environments.) If someone's got a rack setup where a UPS power cutoff will fry everything, they've got a much bigger problem than what we're discussing here. The situation is bad when everything powers up at the same time too, yes. That's why it isn't all powered up at once in server rooms, blade enclosures, etc. Yes. No problem with wanting staggered shutdown, when you have a large number of machines connected, but large numbers of machines connected are not exactly a typical scenario. All of which can be done (and already is, I believe). The only thing that the system is doing while waiting for poweroff is sleep 15m; reboot---no disks need to be spinning for that. If you did not call halt, plus told the kernel to shutdown the devices, no, it was *not* done. And the kernel is the *only* thing that really knows how to properly powerdown the devices. Currently, we cannot ask it to do so from userspace easily, and if we did, we could not access the disks anymore for example. We have hdparm -Y. We can't access the disk after that, but we shouldn't need to. What more shutdown magic do you need on a hard disk that is not spinning? If you're talking about a flaky hardware RAID array where you can't stop the platters without it self-destructing, then fine. I recall that the scripts check for RAID, and behave differently in that case. The issue is how the initscript behaves if the NUT shutdown command doesn't kill everything to kingdon come in 5 seconds. In fact, a proper UPS is going to be programmed to actually *delay* the powerdown load command for enough time to allow the load to try to powerdown for real by itself. Assuming things are as I had in my patch, the idea is to have all machines connected to a given UPS configured with a similar wait-until-poweroff- else-reboot time (if they don't shutdown straightaway). This approach is admittedly not the best one---ideally you'd have some sort of statically-linked death watch daemon that would do the same thing, but also monitor the UPS, and broadcast an online signal if the power returns. You'd no longer have to configure any wait-until-poweroff time, and the aforementioned tradeoff goes away. But this is a wishlist item. Anyway, the disagreement comes down to this: Me: Keep the system minimally running, so that it powers off when the UPS cuts the power, so that it will turn on again when the power returns, given the default behavior and limitations of PC hardware. Do sensible steps to avoid data loss (stop the disks, etc.). Have this be the default, as PC users are the common case. You: Do a normal system shutdown. Rely on server-grade features (e.g. WOL packet from a networked UPS) to resume operation, or an On/Off state: ON BIOS setting (despite the problems associated with that). Have this be the default, as the risk of data loss from fragile storage media trumps that of system unavailability after an extended outage. Mr. Quette will have to decide this, but I don't think you've made a strong case for a power-cut being significantly detrimental to data or hardware. Yes, there are circumstances where this can happen, but these are exceptions to the rule. And in one well-known case (RAID arrays), the scripts can easily do something different. I think you'll take issue with the NUT documentation, then, as it specifically suggests this approach. I will. But maybe, perchance, the NUT docs don't suggest you do it unless you own hardware that cannot do it properly? I didn't read it yet. I'm getting the impression that hardware that cannot do it properly, as you mean it, includes most PCs and non-server machines. Your view carries the day if NUT's userbase is not mostly these. --Daniel -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]