Bug#358696: This is dangerous, please make it default to disabled

2006-07-31 Thread Arnaud Quette

Hi fellows,

2006/7/29, Daniel Richard G. [EMAIL PROTECTED]:

On Fri, 2006 Jul 28 16:12:35 -0300, Henrique de Moraes Holschuh wrote:
...

Anyway, the disagreement comes down to this:

Me: Keep the system minimally running, so that it powers off when the UPS
cuts the power, so that it will turn on again when the power returns, given
the default behavior and limitations of PC hardware. Do sensible steps to
avoid data loss (stop the disks, etc.). Have this be the default, as PC
users are the common case.

You: Do a normal system shutdown. Rely on server-grade features (e.g. WOL
packet from a networked UPS) to resume operation, or an On/Off state: ON
BIOS setting (despite the problems associated with that). Have this be the
default, as the risk of data loss from fragile storage media trumps that of
system unavailability after an extended outage.

Mr. Quette will have to decide this, but I don't think you've made a strong
case for a power-cut being significantly detrimental to data or hardware.
Yes, there are circumstances where this can happen, but these are
exceptions to the rule. And in one well-known case (RAID arrays), the
scripts can easily do something different.

  I think you'll take issue with the NUT documentation, then, as it
  specifically suggests this approach.

 I will.  But maybe, perchance, the NUT docs don't suggest you do it unless
 you own hardware that cannot do it properly?  I didn't read it yet.

I'm getting the impression that hardware that cannot do it properly, as
you mean it, includes most PCs and non-server machines. Your view carries
the day if NUT's userbase is not mostly these.


The point you're talking about is a long standing problem I haven't
yet found a *perfect* solution for.

As you have well stated both, hardware difference, the huge number of
UPSs setup and bios default configuration make it hard (or impossible)
to find The Solution.

Just to avoid misunderstanding: NUT relies by default upon hardware to
be halted, and (BIOS) configured to power on on AC restored.

I'll thus leave the patch, but disable it in -2 (scheduled for release
by tomorrow), referecing the present thread as a WARNING.
When I'll get more time (too busy for the moment with NUT bridging to
HAL, some major code rewrite and internal projects), I'll restart 2
sub project (NPS - NUT Packaging Standard, and QA - Quality Assurance:
https://alioth.debian.org/pm/?group_id=30602) and try to find The
Solution. While the former will focus on NUT integration (ie halt
procedure), the latter will focus on reliability of the UPS poweroff
and such things (like finding upstream workaround for dumb UPSs to
address power races).

Thank you both for your constructive feedback, and don't hesitate to
add more comments.

Arnaud
--
Linux / Unix Expert - MGE UPS SYSTEMS - RD Dpt
Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/
Debian Developer - http://people.debian.org/~aquette/
OpenSource Developer - http://arnaud.quette.free.fr/


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#358696: This is dangerous, please make it default to disabled

2006-07-31 Thread Henrique de Moraes Holschuh
On Fri, 28 Jul 2006, Daniel Richard G. wrote:
 On Fri, 2006 Jul 28 16:12:35 -0300, Henrique de Moraes Holschuh wrote:
  There is no tradeoff without the hack, and the hack is only needed in
  hardware unsuitable for UPS management.  Thus, it must be optional.  It is
  dangerous to data and the hardware, so it should not be the default.
 
 Define (un)suitable for UPS management. Does this definition include
 most people's desktop systems?

Suitable for UPS management:
Load:
Powers up when AC returns
Can be informed that it must shutdown by the UPS
(through NUT).
UPS:
Does delayed load shutdown upon shutdown command
Does not power up the load before it has enough charge
to do a delayed shutdown, plus safety margin.
Always power-cycles the load after a shutdown command is
ACK'ed to the controlling host.  Even if AC
returns, and it doesn't need to shutdown anymore.
Communicates the host when battery charge is below a
certain threshold, so that it can shutdown safely.
Powers up the load if the batteries have enough charge,
and an AC cycle happens while the load is offline.
Powers up the load after a timer expires, if no AC cycles
happen AND the load was broght offline by an explicit
delayed shutdown command.

Anything else is unsuitable.  Any PC97 desktop should be suitable for proper
UPS management.  And just FYI, PC97 requires WoL on all ethernet devices,
not that you need WoL for a proper UPS setup, but you somehow got the idea
that WoL was a server-grade feature...

  You have transient responses to power cuts.  Watch in an osciloscope,
  computer hardware is not a resistive load.
 
 No, but any decent power supply will present a load pretty close to it, 

Only ones with PFC. 

 production server-room environments.) If someone's got a rack setup where a 
 UPS power cutoff will fry everything, they've got a much bigger problem 
 than what we're discussing here.

Yes.

 number of machines connected, but large numbers of machines connected are 
 not exactly a typical scenario.

No, but your hard-drive doing emergency unloads is a typical scenario, and
desktop HDs don't like those unloads *at* *all*.  Do not do it (and as I
already said, the only proper way to know the HD heads are unloaded requires
kernel cooperation, and it is NOT done by userspace currently). 

I know you were under the mistaken impression that we could guarantee all
HD heads were unloaded in userspace, and before halt runs.  We not only
cannot do it, we also do not *attempt* to do it.  The only thing in Debian
initscripts that really tries to take care of HD head unloads is the halt
command.

You can, of course, try to make sure hdparm was run and actually uloaded all
heads for your particular configuration, but it is not an acceptable
default, because we cannot get it right every time.  So implement it as an
admin-enabled, admin-configured option by all means.  But *not* as a
default.

   All of which can be done (and already is, I believe). The only thing that 
   the system is doing while waiting for poweroff is sleep 15m; 
   reboot---no 
   disks need to be spinning for that.
  
  If you did not call halt, plus told the kernel to shutdown the devices, no,
  it was *not* done.
  
  And the kernel is the *only* thing that really knows how to properly
  powerdown the devices.  Currently, we cannot ask it to do so from userspace
  easily, and if we did, we could not access the disks anymore for example.
 
 We have hdparm -Y. We can't access the disk after that, but we shouldn't 
 need to. What more shutdown magic do you need on a hard disk that is not 
 spinning?

None.  If the disk spun down, but hdparm doesn't work for all disks.  And we
cannot reliably spin down all disks and uload heads from userspace, for all
possible configurations.  Thus, anything that relies on this cannot be made
a default.

 If you're talking about a flaky hardware RAID array where you can't stop 

SCSI plus all software RAID arrays.

  The issue is how the initscript behaves if the NUT shutdown command doesn't
  kill everything to kingdon come in 5 seconds.  In fact, a proper UPS is
  going to be programmed to actually *delay* the powerdown load command for
  enough time to allow the load to try to powerdown for real by itself.
 
 Assuming things are as I had in my patch, the idea is to have all machines 
 connected to a given UPS configured with a similar wait-until-poweroff- 
 else-reboot time (if they don't shutdown straightaway).

The bad thing in your patch is that the maintainer made it non-optional, and
the default.  I understand it will not be a default anymore, which is enough
for me.

 Anyway, the disagreement comes down 

Bug#358696: This is dangerous, please make it default to disabled

2006-07-31 Thread Daniel Richard G.
On Mon, 2006 Jul 31 13:47:21 -0300, Henrique de Moraes Holschuh wrote:
  
  Define (un)suitable for UPS management. Does this definition include
  most people's desktop systems?
 
 Suitable for UPS management:
   Load:
   Powers up when AC returns
   Can be informed that it must shutdown by the UPS
   (through NUT).

Okay, so pretty much anything that can run NUT. Nice.

   UPS:
   Does delayed load shutdown upon shutdown command
   Does not power up the load before it has enough charge
   to do a delayed shutdown, plus safety margin.

Pretty basic stuff, yes.

   Always power-cycles the load after a shutdown command is
   ACK'ed to the controlling host.  Even if AC
   returns, and it doesn't need to shutdown anymore.

Many low-end UPSes fail here. Power races would be an academic issue if not 
for this.

   Communicates the host when battery charge is below a
   certain threshold, so that it can shutdown safely.
   Powers up the load if the batteries have enough charge,
   and an AC cycle happens while the load is offline.
   Powers up the load after a timer expires, if no AC cycles
   happen AND the load was broght offline by an explicit
   delayed shutdown command.
 
 Anything else is unsuitable.

Hi, I'm Bob, and I have an unsuitable UPS. Can I use it with Debian?

 Any PC97 desktop should be suitable for proper UPS management.  And just 
 FYI, PC97 requires WoL on all ethernet devices, not that you need WoL for 
 a proper UPS setup, but you somehow got the idea that WoL was a 
 server-grade feature...

Fair enough, but a UPS with an Ethernet port (and a means of configuring 
WoL) certainly is. If not in purpose, then in price.

  No, but any decent power supply will present a load pretty close to it, 
 
 Only ones with PFC. 

Decent power supplies have PFC.

  number of machines connected, but large numbers of machines connected are 
  not exactly a typical scenario.
 
 No, but your hard-drive doing emergency unloads is a typical scenario, and
 desktop HDs don't like those unloads *at* *all*.  Do not do it (and as I
 already said, the only proper way to know the HD heads are unloaded requires
 kernel cooperation, and it is NOT done by userspace currently). 
 
 I know you were under the mistaken impression that we could guarantee all
 HD heads were unloaded in userspace, and before halt runs.  We not only
 cannot do it, we also do not *attempt* to do it.  The only thing in Debian
 initscripts that really tries to take care of HD head unloads is the halt
 command.
 
 You can, of course, try to make sure hdparm was run and actually uloaded all
 heads for your particular configuration, but it is not an acceptable
 default, because we cannot get it right every time.  So implement it as an
 admin-enabled, admin-configured option by all means.  But *not* as a
 default.

Perhaps the sleep-then-reboot loop belongs inside the halt command, then. 
At some point, there's going to be little difference between cutting power 
to the PSU, and having the PSU do a soft poweroff.

  need to. What more shutdown magic do you need on a hard disk that is not 
  spinning?
 
 None.  If the disk spun down, but hdparm doesn't work for all disks.  And we
 cannot reliably spin down all disks and uload heads from userspace, for all
 possible configurations.  Thus, anything that relies on this cannot be made
 a default.

  If you're talking about a flaky hardware RAID array where you can't stop 
 
 SCSI plus all software RAID arrays.

I mean _after_ mdadm is stopped. Not that any distinction is currently made 
between RAID setups, flaky or otherwise.

 The bad thing in your patch is that the maintainer made it non-optional, and
 the default.  I understand it will not be a default anymore, which is enough
 for me.

I agree that having it non-optional is undesirable.

  You: Do a normal system shutdown. Rely on server-grade features (e.g. WOL 
 
 No.  Me:  make the whole behaviour you want *optional*, and not the default,
 because it is dangerous and we don't have a lick of a chance of making it
 safe for all setups.

  packet from a networked UPS) to resume operation, or an On/Off state: ON
 
 No.  Rely on standard PC97 ACPI desktop BIOS option always power on on AC
 return, which is the correct way to deal with machines that need to restart
 when an UPS powers it up again.

Correct? The PC will then always turn on when the AC returns, e.g. when 
being plugged in, or after a power outage when it was off to begin with. 
The PSU's hard power switch isn't a solution, either, as it is often 
inconvenient/inaccessible and many newer consumer PSUs don't even have one.

The real solution is to have an on/off state bit that can be frobbed by the 
OS, but I'm not holding my breath 

Bug#358696: This is dangerous, please make it default to disabled

2006-07-28 Thread Daniel Richard G.
On Fri, 2006 Jul 28 14:46:47 -0300, Henrique de Moraes Holschuh wrote:
 
 1. The UPS may take more than 15 minutes to shutdown the load.  You cannot
 assume things like this, and you will cause data loss if you get it wrong:
 the power-off could come with the system fully online.

The time period should be configurable; I just suggested 15 minutes as a 
default. You could set a higher value, but the tradeoff is that if the 
power returns, the system is unavailable for that time period.

 2. Not powering off the box by itself (read: allowing halt and the kernel to
 do its job and cut power cleanly) means it will be subject to high
 transients when the UPS shuts down the load.  This will, in turn, make it
 worse for the other loads that have not been properly shut down.  It would
 be a disaster in a server farm.

Please elaborate on how server equipment is subjected to a transient when a 
UPS cuts power to it. (If anything, the situation is much worse when it is 
powered back on.)

 3. Non-controlled shutdowns are *very* bad for all hardware, including
 desktop systems.  For starters, all disks will be subject to emergency head
 unloads.  The halt utility does a lot of work-around on kenrel bugs to make
 sure disks are parked, RAID arrays are in read-only mode or shutdown, etc
 for a damn good reason.

All of which can be done (and already is, I believe). The only thing that 
the system is doing while waiting for poweroff is sleep 15m; reboot---no 
disks need to be spinning for that.

 4. It is very probable that in any non-home scenarios, an UPS will protect
 more than one equipment.  In those scenarios, the UPS is configured to NOT
 accept immediate shutdown the load command from any of the equipments,
 just from the main controller host.  Nut is geared to work fine and
 specifically support such configurations.  This has to be taken into
 account.

Isn't this already the case for non-networked UPSes? When the interface is 
serial or USB, it can only be connected to (and controlled by) a single, 
master host.

 Thus, implementing the work around proposed in this bug report as a default
 behaviour is not acceptable.  Please revert the change, or make it optional,
 and *not* enabled by default.   I would go even further and actively
 discourage heavily the use of this option, as it can damage the hardware.

I think you'll take issue with the NUT documentation, then, as it 
specifically suggests this approach.


--Daniel


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#358696: This is dangerous, please make it default to disabled

2006-07-28 Thread Henrique de Moraes Holschuh
On Fri, 28 Jul 2006, Daniel Richard G. wrote:
 The time period should be configurable; I just suggested 15 minutes as a 
 default. You could set a higher value, but the tradeoff is that if the 
 power returns, the system is unavailable for that time period.

There is no tradeoff without the hack, and the hack is only needed in
hardware unsuitable for UPS management.  Thus, it must be optional.  It is
dangerous to data and the hardware, so it should not be the default.

It is fairly simple, really, unless I missed something major (which is
always possible).

  2. Not powering off the box by itself (read: allowing halt and the kernel to
  do its job and cut power cleanly) means it will be subject to high
  transients when the UPS shuts down the load.  This will, in turn, make it
  worse for the other loads that have not been properly shut down.  It would
  be a disaster in a server farm.
 
 Please elaborate on how server equipment is subjected to a transient when a 
 UPS cuts power to it. (If anything, the situation is much worse when it is 
 powered back on.)

You have transient responses to power cuts.  Watch in an osciloscope,
computer hardware is not a resistive load.

The situation is bad when everything powers up at the same time too, yes.
That's why it isn't all powered up at once in server rooms, blade
enclosures, etc.

  3. Non-controlled shutdowns are *very* bad for all hardware, including
  desktop systems.  For starters, all disks will be subject to emergency head
  unloads.  The halt utility does a lot of work-around on kenrel bugs to make
  sure disks are parked, RAID arrays are in read-only mode or shutdown, etc
  for a damn good reason.
 
 All of which can be done (and already is, I believe). The only thing that 
 the system is doing while waiting for poweroff is sleep 15m; reboot---no 
 disks need to be spinning for that.

If you did not call halt, plus told the kernel to shutdown the devices, no,
it was *not* done.

And the kernel is the *only* thing that really knows how to properly
powerdown the devices.  Currently, we cannot ask it to do so from userspace
easily, and if we did, we could not access the disks anymore for example.

 Isn't this already the case for non-networked UPSes? When the interface is 
 serial or USB, it can only be connected to (and controlled by) a single, 
 master host.

The issue is how the initscript behaves if the NUT shutdown command doesn't
kill everything to kingdon come in 5 seconds.  In fact, a proper UPS is
going to be programmed to actually *delay* the powerdown load command for
enough time to allow the load to try to powerdown for real by itself.

  Thus, implementing the work around proposed in this bug report as a default
  behaviour is not acceptable.  Please revert the change, or make it optional,
  and *not* enabled by default.   I would go even further and actively
  discourage heavily the use of this option, as it can damage the hardware.
 
 I think you'll take issue with the NUT documentation, then, as it 
 specifically suggests this approach.

I will.  But maybe, perchance, the NUT docs don't suggest you do it unless
you own hardware that cannot do it properly?  I didn't read it yet.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#358696: This is dangerous, please make it default to disabled

2006-07-28 Thread Daniel Richard G.
On Fri, 2006 Jul 28 16:12:35 -0300, Henrique de Moraes Holschuh wrote:
 
 There is no tradeoff without the hack, and the hack is only needed in
 hardware unsuitable for UPS management.  Thus, it must be optional.  It is
 dangerous to data and the hardware, so it should not be the default.

Define (un)suitable for UPS management. Does this definition include
most people's desktop systems?

 You have transient responses to power cuts.  Watch in an osciloscope,
 computer hardware is not a resistive load.

No, but any decent power supply will present a load pretty close to it, 
making such a transient negligible. (I know this to be the case in 
production server-room environments.) If someone's got a rack setup where a 
UPS power cutoff will fry everything, they've got a much bigger problem 
than what we're discussing here.

 The situation is bad when everything powers up at the same time too, yes.
 That's why it isn't all powered up at once in server rooms, blade
 enclosures, etc.

Yes. No problem with wanting staggered shutdown, when you have a large 
number of machines connected, but large numbers of machines connected are 
not exactly a typical scenario.

  All of which can be done (and already is, I believe). The only thing that 
  the system is doing while waiting for poweroff is sleep 15m; reboot---no 
  disks need to be spinning for that.
 
 If you did not call halt, plus told the kernel to shutdown the devices, no,
 it was *not* done.
 
 And the kernel is the *only* thing that really knows how to properly
 powerdown the devices.  Currently, we cannot ask it to do so from userspace
 easily, and if we did, we could not access the disks anymore for example.

We have hdparm -Y. We can't access the disk after that, but we shouldn't 
need to. What more shutdown magic do you need on a hard disk that is not 
spinning?

If you're talking about a flaky hardware RAID array where you can't stop 
the platters without it self-destructing, then fine. I recall that the 
scripts check for RAID, and behave differently in that case.

 The issue is how the initscript behaves if the NUT shutdown command doesn't
 kill everything to kingdon come in 5 seconds.  In fact, a proper UPS is
 going to be programmed to actually *delay* the powerdown load command for
 enough time to allow the load to try to powerdown for real by itself.

Assuming things are as I had in my patch, the idea is to have all machines 
connected to a given UPS configured with a similar wait-until-poweroff- 
else-reboot time (if they don't shutdown straightaway).

This approach is admittedly not the best one---ideally you'd have some sort 
of statically-linked death watch daemon that would do the same thing, but 
also monitor the UPS, and broadcast an online signal if the power 
returns. You'd no longer have to configure any wait-until-poweroff time, 
and the aforementioned tradeoff goes away. But this is a wishlist item.

Anyway, the disagreement comes down to this:

Me: Keep the system minimally running, so that it powers off when the UPS 
cuts the power, so that it will turn on again when the power returns, given 
the default behavior and limitations of PC hardware. Do sensible steps to 
avoid data loss (stop the disks, etc.). Have this be the default, as PC 
users are the common case.

You: Do a normal system shutdown. Rely on server-grade features (e.g. WOL 
packet from a networked UPS) to resume operation, or an On/Off state: ON 
BIOS setting (despite the problems associated with that). Have this be the 
default, as the risk of data loss from fragile storage media trumps that of 
system unavailability after an extended outage.

Mr. Quette will have to decide this, but I don't think you've made a strong 
case for a power-cut being significantly detrimental to data or hardware. 
Yes, there are circumstances where this can happen, but these are 
exceptions to the rule. And in one well-known case (RAID arrays), the 
scripts can easily do something different.

  I think you'll take issue with the NUT documentation, then, as it 
  specifically suggests this approach.
 
 I will.  But maybe, perchance, the NUT docs don't suggest you do it unless
 you own hardware that cannot do it properly?  I didn't read it yet.

I'm getting the impression that hardware that cannot do it properly, as 
you mean it, includes most PCs and non-server machines. Your view carries 
the day if NUT's userbase is not mostly these.


--Daniel


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]