Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Paul Mackerras
Linas Vepstas writes:

> > 2. I don't see why the device nodes for the PCI subtree being reset
> >would go away, and thus I don't see the need for your eeh_cfg_tree
> >struct.
> 
> Its not the reset, its the hot-plug remove.  The hot plug code assumes
> that you are going to physically remove the device from the slot, so
> it removes the device_node as part of the "unconfig".  

OK, I missed that.  It seems a bit bogus to me.  Could you point me at
where in the code this happens?

> > 3. Is there a good reason why we can't use the assigned-addresses
> >property on the relevant device tree nodes to tell us what to set
> >the BARs to?
> 
> Yes, the reason is that after a reset, that property doesn't hold any 
> decent data.   I discussed this with the firmware developers, and thier 
> response was that it is the kernel's responsibility to compute 
> (or save/restore) such values.  (Except for bridges, which they will do for 
> us).

The not holding any decent data is a consequence of the device nodes
getting thrown away, isn't it?  I fail to see how resetting the device
can of itself affect our copy of the device tree.

> > In particular I think it should be a
> >userland write to a sysfs file that kicks off the restart process
> >rather than it just happening after 5 seconds.  Anyway, what
> >process or thread is executing that 5 second sleep?  Is it keventd
> >or something?
> 
> Its a workqueue.

Which get run in keventd's context.  In other words no other
workqueues will get run during the 5 second sleep, or at least not on
that cpu.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Linas Vepstas
On Wed, Jan 19, 2005 at 05:06:05PM +1100, Paul Mackerras was heard to remark:
> Linas Vepstas writes:
> 
> > p.s.  It was not clear to me if the EEH patch previously sent 
> > (6 January 2005, same subject line) will be wending its way into 
> > the main Torvalds kernel tree, or not.  I hadn't really gotten
> > confirmation one way or another.
> 
> I'm not really totally happy with it yet, on a number of fronts:

[...]

I forgot to mention: while I agree with some/many of these points,
especially with regards to recovery, I'd also like to note that the 
patch was mailed in two independent parts:  

-- a number of generic infrastructure routines, all in a ppc64 patch, and
-- the code that actually performs the recovery, as a patch to 
   the drivers/pci/hotplug subsystem.

While the actual recovery code is controversial (e.g. no support of 
scsi recovery), I'd like to at least get in the the generic 
infrastructure pieces.  

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Linas Vepstas

On Wed, Jan 19, 2005 at 05:06:05PM +1100, Paul Mackerras was heard to remark:
> Linas Vepstas writes:
> 
> > p.s.  It was not clear to me if the EEH patch previously sent 
> > (6 January 2005, same subject line) will be wending its way into 
> > the main Torvalds kernel tree, or not.  I hadn't really gotten
> > confirmation one way or another.
> 
> I'm not really totally happy with it yet, on a number of fronts:
> 
> 1. You're adding more PCI-specific stuff to the device_node struct,
>which I don't like.  I would prefer that the device_node tree
>contains basically just what we get from OF, and that we have a
>separate struct for storing ppc64-specific information for each PCI
>device.  Fixing that is outside the scope of your patch, though.

I wrote this down on my to-do list.  Its the sort of thing that 
evaporates from my consciousness when other things come along,
but I'll give it a shot.  

> 2. I don't see why the device nodes for the PCI subtree being reset
>would go away, and thus I don't see the need for your eeh_cfg_tree
>struct.

Its not the reset, its the hot-plug remove.  The hot plug code assumes
that you are going to physically remove the device from the slot, so
it removes the device_node as part of the "unconfig".  

Of course, I found this out only after performing a null-pointer deref.
Note only does the node go away, but all of the various pointers it holds
are zeroed in the process.  

The cfg tree holds on to those pointers, so that I wouldn't have to
muck with the device_node removal code to do something tricky.

> 3. Is there a good reason why we can't use the assigned-addresses
>property on the relevant device tree nodes to tell us what to set
>the BARs to?

Yes, the reason is that after a reset, that property doesn't hold any 
decent data.   I discussed this with the firmware developers, and thier 
response was that it is the kernel's responsibility to compute 
(or save/restore) such values.  (Except for bridges, which they will do for us).

> 4. I think the 5 second sleep is quite bogus, and shows that we have
>the flow of control wrong.  

:)  Yes, well, indeed it is.  Don't look at me, not my idea.

> In particular I think it should be a
>userland write to a sysfs file that kicks off the restart process
>rather than it just happening after 5 seconds.  Anyway, what
>process or thread is executing that 5 second sleep?  Is it keventd
>or something?

Its a workqueue.

> 5. AFAICS userland will get an unplug notification for the device, but
>nothing to indicate that is due to an EEH slot isolation event.  I
>think userland should be told about EEH events.

In principle, I'd agree. In practice, this would seem to require changes
or additions or enhancements to udev that I don't quite understand, as
well as potential changes to udev scripts.  Maybe I don't understand
sysfs sufficiently well.  I am very tempted to punt on this, and wait 
for the Intel-backed PCI-E code to get to this point, and then do whatever 
they're doing.

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Linas Vepstas

On Wed, Jan 19, 2005 at 05:06:05PM +1100, Paul Mackerras was heard to remark:
 Linas Vepstas writes:
 
  p.s.  It was not clear to me if the EEH patch previously sent 
  (6 January 2005, same subject line) will be wending its way into 
  the main Torvalds kernel tree, or not.  I hadn't really gotten
  confirmation one way or another.
 
 I'm not really totally happy with it yet, on a number of fronts:
 
 1. You're adding more PCI-specific stuff to the device_node struct,
which I don't like.  I would prefer that the device_node tree
contains basically just what we get from OF, and that we have a
separate struct for storing ppc64-specific information for each PCI
device.  Fixing that is outside the scope of your patch, though.

I wrote this down on my to-do list.  Its the sort of thing that 
evaporates from my consciousness when other things come along,
but I'll give it a shot.  

 2. I don't see why the device nodes for the PCI subtree being reset
would go away, and thus I don't see the need for your eeh_cfg_tree
struct.

Its not the reset, its the hot-plug remove.  The hot plug code assumes
that you are going to physically remove the device from the slot, so
it removes the device_node as part of the unconfig.  

Of course, I found this out only after performing a null-pointer deref.
Note only does the node go away, but all of the various pointers it holds
are zeroed in the process.  

The cfg tree holds on to those pointers, so that I wouldn't have to
muck with the device_node removal code to do something tricky.

 3. Is there a good reason why we can't use the assigned-addresses
property on the relevant device tree nodes to tell us what to set
the BARs to?

Yes, the reason is that after a reset, that property doesn't hold any 
decent data.   I discussed this with the firmware developers, and thier 
response was that it is the kernel's responsibility to compute 
(or save/restore) such values.  (Except for bridges, which they will do for us).

 4. I think the 5 second sleep is quite bogus, and shows that we have
the flow of control wrong.  

:)  Yes, well, indeed it is.  Don't look at me, not my idea.

 In particular I think it should be a
userland write to a sysfs file that kicks off the restart process
rather than it just happening after 5 seconds.  Anyway, what
process or thread is executing that 5 second sleep?  Is it keventd
or something?

Its a workqueue.

 5. AFAICS userland will get an unplug notification for the device, but
nothing to indicate that is due to an EEH slot isolation event.  I
think userland should be told about EEH events.

In principle, I'd agree. In practice, this would seem to require changes
or additions or enhancements to udev that I don't quite understand, as
well as potential changes to udev scripts.  Maybe I don't understand
sysfs sufficiently well.  I am very tempted to punt on this, and wait 
for the Intel-backed PCI-E code to get to this point, and then do whatever 
they're doing.

--linas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Linas Vepstas
On Wed, Jan 19, 2005 at 05:06:05PM +1100, Paul Mackerras was heard to remark:
 Linas Vepstas writes:
 
  p.s.  It was not clear to me if the EEH patch previously sent 
  (6 January 2005, same subject line) will be wending its way into 
  the main Torvalds kernel tree, or not.  I hadn't really gotten
  confirmation one way or another.
 
 I'm not really totally happy with it yet, on a number of fronts:

[...]

I forgot to mention: while I agree with some/many of these points,
especially with regards to recovery, I'd also like to note that the 
patch was mailed in two independent parts:  

-- a number of generic infrastructure routines, all in a ppc64 patch, and
-- the code that actually performs the recovery, as a patch to 
   the drivers/pci/hotplug subsystem.

While the actual recovery code is controversial (e.g. no support of 
scsi recovery), I'd like to at least get in the the generic 
infrastructure pieces.  

--linas
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-20 Thread Paul Mackerras
Linas Vepstas writes:

  2. I don't see why the device nodes for the PCI subtree being reset
 would go away, and thus I don't see the need for your eeh_cfg_tree
 struct.
 
 Its not the reset, its the hot-plug remove.  The hot plug code assumes
 that you are going to physically remove the device from the slot, so
 it removes the device_node as part of the unconfig.  

OK, I missed that.  It seems a bit bogus to me.  Could you point me at
where in the code this happens?

  3. Is there a good reason why we can't use the assigned-addresses
 property on the relevant device tree nodes to tell us what to set
 the BARs to?
 
 Yes, the reason is that after a reset, that property doesn't hold any 
 decent data.   I discussed this with the firmware developers, and thier 
 response was that it is the kernel's responsibility to compute 
 (or save/restore) such values.  (Except for bridges, which they will do for 
 us).

The not holding any decent data is a consequence of the device nodes
getting thrown away, isn't it?  I fail to see how resetting the device
can of itself affect our copy of the device tree.

  In particular I think it should be a
 userland write to a sysfs file that kicks off the restart process
 rather than it just happening after 5 seconds.  Anyway, what
 process or thread is executing that 5 second sleep?  Is it keventd
 or something?
 
 Its a workqueue.

Which get run in keventd's context.  In other words no other
workqueues will get run during the 5 second sleep, or at least not on
that cpu.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-19 Thread Nathan Fontenot
Paul Mackerras wrote:
5. AFAICS userland will get an unplug notification for the device, but
   nothing to indicate that is due to an EEH slot isolation event.  I
   think userland should be told about EEH events.
Currently there is a way for userland to determine if a hotplug event 
they receive is due to an EEH slot isolation event.  It's not very 
pretty and requires the rtas_errd daemon to be running.

The RTAS event generated from the EEH event is logged to 
/var/log/platform by rtas_errd.  Userland scripts would have to search 
the file for a recent EEH event matching their device to make this 
determination.  This isn't as nice as a direct notification but is what 
we have at this point.

--
Nathan Fontenot
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-19 Thread Nathan Fontenot
Paul Mackerras wrote:
5. AFAICS userland will get an unplug notification for the device, but
   nothing to indicate that is due to an EEH slot isolation event.  I
   think userland should be told about EEH events.
Currently there is a way for userland to determine if a hotplug event 
they receive is due to an EEH slot isolation event.  It's not very 
pretty and requires the rtas_errd daemon to be running.

The RTAS event generated from the EEH event is logged to 
/var/log/platform by rtas_errd.  Userland scripts would have to search 
the file for a recent EEH event matching their device to make this 
determination.  This isn't as nice as a direct notification but is what 
we have at this point.

--
Nathan Fontenot
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-18 Thread Paul Mackerras
Linas Vepstas writes:

> p.s.  It was not clear to me if the EEH patch previously sent 
> (6 January 2005, same subject line) will be wending its way into 
> the main Torvalds kernel tree, or not.  I hadn't really gotten
> confirmation one way or another.

I'm not really totally happy with it yet, on a number of fronts:

1. You're adding more PCI-specific stuff to the device_node struct,
   which I don't like.  I would prefer that the device_node tree
   contains basically just what we get from OF, and that we have a
   separate struct for storing ppc64-specific information for each PCI
   device.  Fixing that is outside the scope of your patch, though.

2. I don't see why the device nodes for the PCI subtree being reset
   would go away, and thus I don't see the need for your eeh_cfg_tree
   struct.

3. Is there a good reason why we can't use the assigned-addresses
   property on the relevant device tree nodes to tell us what to set
   the BARs to?

4. I think the 5 second sleep is quite bogus, and shows that we have
   the flow of control wrong.  In particular I think it should be a
   userland write to a sysfs file that kicks off the restart process
   rather than it just happening after 5 seconds.  Anyway, what
   process or thread is executing that 5 second sleep?  Is it keventd
   or something?

5. AFAICS userland will get an unplug notification for the device, but
   nothing to indicate that is due to an EEH slot isolation event.  I
   think userland should be told about EEH events.

Regards,
Paul.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-18 Thread Paul Mackerras
Linas Vepstas writes:

 p.s.  It was not clear to me if the EEH patch previously sent 
 (6 January 2005, same subject line) will be wending its way into 
 the main Torvalds kernel tree, or not.  I hadn't really gotten
 confirmation one way or another.

I'm not really totally happy with it yet, on a number of fronts:

1. You're adding more PCI-specific stuff to the device_node struct,
   which I don't like.  I would prefer that the device_node tree
   contains basically just what we get from OF, and that we have a
   separate struct for storing ppc64-specific information for each PCI
   device.  Fixing that is outside the scope of your patch, though.

2. I don't see why the device nodes for the PCI subtree being reset
   would go away, and thus I don't see the need for your eeh_cfg_tree
   struct.

3. Is there a good reason why we can't use the assigned-addresses
   property on the relevant device tree nodes to tell us what to set
   the BARs to?

4. I think the 5 second sleep is quite bogus, and shows that we have
   the flow of control wrong.  In particular I think it should be a
   userland write to a sysfs file that kicks off the restart process
   rather than it just happening after 5 seconds.  Anyway, what
   process or thread is executing that 5 second sleep?  Is it keventd
   or something?

5. AFAICS userland will get an unplug notification for the device, but
   nothing to indicate that is due to an EEH slot isolation event.  I
   think userland should be told about EEH events.

Regards,
Paul.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC64: EEH Recovery

2005-01-17 Thread Linas Vepstas

Andrew,

The attached file describes PCI bus EEH "Extended Error Handling"
concepts and operation;  could you drop this into the kernel
documentation tree, at
linux-2.6/Documentation/powerpc/eeh-pci-error-recovery.txt ?

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>

--linas

p.s.  It was not clear to me if the EEH patch previously sent 
(6 January 2005, same subject line) will be wending its way into 
the main Torvalds kernel tree, or not.  I hadn't really gotten
confirmation one way or another.




  PCI Bus EEH Error Recovery
  --
   Linas Vepstas
   <[EMAIL PROTECTED]>
  12 January 2005


Overview:
-
The IBM POWER-based pSeries and iSeries computers include PCI bus 
controller chips that have extended capabilities for detecting and 
reporting a large variety of PCI bus error conditions.  These features 
go under the name of "EEH", for "Extended Error Handling".  The EEH
hardware features allow PCI bus errors to be cleared and a PCI
card to be "rebooted", without also having to reboot the operating
system.  

This is in contrast to traditional PCI error handling, where the 
PCI chip is wired directly to the CPU, and an error would cause 
a CPU machine-check/check-stop condition, halting the CPU entirely. 
Another "traditional" technique is to ignore such errors, which
can lead to data corruption, both of user data or of kernel data,
hung/unresponsive adapters, or system crashes/lockups.  Thus, 
the idea behind EEH is that the operating system can become more
reliable and robust by protecting it from PCI errors, and giving
the OS the ability to "reboot"/recover individual PCI devices.

Future systems from other vendors, based on the PCI-E specification,
may contain similar features. 


Causes of EEH Errors

EEH was originally designed to guard against hardware failure, such 
as PCI cards dying from heat, humidity, dust, vibration and bad 
electrical connections. The vast majority of EEH errors seen in 
"real life" are due to eithr poorly seated PCI cards, or, 
unfortunately quite commonly, due device driver bugs, device firmware 
bugs, and sometimes PCI card hardware bugs.

The most common software bug, is one that causes the device to
attempt to DMA to a location in system memory that has not been 
reserved for DMA access for that card.  This is a powerful feature, 
as it prevents what; otherwise, would have been silent memory 
corruption caused by the bad DMA.  A number of device driver
bugs have been found and fixed in this way over the past few 
years.  Other possible causes of EEH errors include data or 
address line parity errors (for example, due to poor electrical 
connectivity due to a poorly seated card), and PCI-X split-completion 
errors (due to software, device firmware, or device PCI hardware bugs). 
The vast majority of "true hardware failures" can be cured by
physically removing and re-seating the PCI card.


Detection and Recovery
--
In the following discussion, a generic overview of how to detect 
and recover from EEH errors will be presented. This is followed
by an overview of how the current implementation in the Linux
kernel does it.  The actual implementation is subject to change,
and some of the finer points are still being debated.  These 
may in turn be swayed if or when other architectures implement 
similar functionality.

When a PCI Host Bridge (PHB, the bus controller connecting the 
PCI bus to the system CPU electronics complex) detects a PCI error
condition, it will "isolate" the affected PCI card.  Isolation 
will block all writes (either to the card from the system, or 
from the card to the system), and it will cause all reads to 
return all-ff's (0xff, 0x, 0x for 8/16/32-bit reads).
This value was chosen because it is the same value you would
get if the device was physically unplugged from the slot.
This includes access to PCI memory, I/O space, and PCI config 
space.  Interrupts; however, will continued to be delivered.

Detection and recovery are performed with the aid of ppc64 
firmware.  The programming interfaces in the Linux kernel 
into the firmware are referred to as RTAS (Run-Time Abstraction 
Services).  The Linux kernel does not (should not) access
the EEH function in the PCI chipsets directly, primarily because 
there are a number of different chipsets out there, each with 
different interfaces and quirks. The firmware provides a 
uniform abstraction layer that will work with all pSeries 
and iSeries hardware (and be forwards-compatible).

If the OS or device driver suspects that a PCI slot has been 
EEH-isolated, there is a firmware call it can make to determine if 
this is the case. If so, then the device driver should put itself 
into a consistent state (given that it won't be able to complete any 
pending work) and start recovery of the card.  Recovery normally 

Re: [PATCH] PPC64: EEH Recovery

2005-01-17 Thread Linas Vepstas

Andrew,

The attached file describes PCI bus EEH Extended Error Handling
concepts and operation;  could you drop this into the kernel
documentation tree, at
linux-2.6/Documentation/powerpc/eeh-pci-error-recovery.txt ?

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]

--linas

p.s.  It was not clear to me if the EEH patch previously sent 
(6 January 2005, same subject line) will be wending its way into 
the main Torvalds kernel tree, or not.  I hadn't really gotten
confirmation one way or another.




  PCI Bus EEH Error Recovery
  --
   Linas Vepstas
   [EMAIL PROTECTED]
  12 January 2005


Overview:
-
The IBM POWER-based pSeries and iSeries computers include PCI bus 
controller chips that have extended capabilities for detecting and 
reporting a large variety of PCI bus error conditions.  These features 
go under the name of EEH, for Extended Error Handling.  The EEH
hardware features allow PCI bus errors to be cleared and a PCI
card to be rebooted, without also having to reboot the operating
system.  

This is in contrast to traditional PCI error handling, where the 
PCI chip is wired directly to the CPU, and an error would cause 
a CPU machine-check/check-stop condition, halting the CPU entirely. 
Another traditional technique is to ignore such errors, which
can lead to data corruption, both of user data or of kernel data,
hung/unresponsive adapters, or system crashes/lockups.  Thus, 
the idea behind EEH is that the operating system can become more
reliable and robust by protecting it from PCI errors, and giving
the OS the ability to reboot/recover individual PCI devices.

Future systems from other vendors, based on the PCI-E specification,
may contain similar features. 


Causes of EEH Errors

EEH was originally designed to guard against hardware failure, such 
as PCI cards dying from heat, humidity, dust, vibration and bad 
electrical connections. The vast majority of EEH errors seen in 
real life are due to eithr poorly seated PCI cards, or, 
unfortunately quite commonly, due device driver bugs, device firmware 
bugs, and sometimes PCI card hardware bugs.

The most common software bug, is one that causes the device to
attempt to DMA to a location in system memory that has not been 
reserved for DMA access for that card.  This is a powerful feature, 
as it prevents what; otherwise, would have been silent memory 
corruption caused by the bad DMA.  A number of device driver
bugs have been found and fixed in this way over the past few 
years.  Other possible causes of EEH errors include data or 
address line parity errors (for example, due to poor electrical 
connectivity due to a poorly seated card), and PCI-X split-completion 
errors (due to software, device firmware, or device PCI hardware bugs). 
The vast majority of true hardware failures can be cured by
physically removing and re-seating the PCI card.


Detection and Recovery
--
In the following discussion, a generic overview of how to detect 
and recover from EEH errors will be presented. This is followed
by an overview of how the current implementation in the Linux
kernel does it.  The actual implementation is subject to change,
and some of the finer points are still being debated.  These 
may in turn be swayed if or when other architectures implement 
similar functionality.

When a PCI Host Bridge (PHB, the bus controller connecting the 
PCI bus to the system CPU electronics complex) detects a PCI error
condition, it will isolate the affected PCI card.  Isolation 
will block all writes (either to the card from the system, or 
from the card to the system), and it will cause all reads to 
return all-ff's (0xff, 0x, 0x for 8/16/32-bit reads).
This value was chosen because it is the same value you would
get if the device was physically unplugged from the slot.
This includes access to PCI memory, I/O space, and PCI config 
space.  Interrupts; however, will continued to be delivered.

Detection and recovery are performed with the aid of ppc64 
firmware.  The programming interfaces in the Linux kernel 
into the firmware are referred to as RTAS (Run-Time Abstraction 
Services).  The Linux kernel does not (should not) access
the EEH function in the PCI chipsets directly, primarily because 
there are a number of different chipsets out there, each with 
different interfaces and quirks. The firmware provides a 
uniform abstraction layer that will work with all pSeries 
and iSeries hardware (and be forwards-compatible).

If the OS or device driver suspects that a PCI slot has been 
EEH-isolated, there is a firmware call it can make to determine if 
this is the case. If so, then the device driver should put itself 
into a consistent state (given that it won't be able to complete any 
pending work) and start recovery of the card.  Recovery normally 
would consist of