On Mon, 2017-07-24 at 20:30 +0200, Borislav Petkov wrote:
:
>
> So I don't want to break existing users and thus make only explicitly
> known platforms load ghes_edac. In the current case, the HPE
> machines. All the rest will simply use the platform drivers and
> nothing will change for them.
>
(Sending to your other mail address because there's some temporary resolution
issue:
msmtp: recipient address mche...@s-opensource.com not accepted by the server
msmtp: server message: 451 4.3.0 : Temporary lookup
failure
msmtp: could not send mail (account alien8.de from /home/boris/.msmtprc)
On Mon, Jul 24, 2017 at 05:54:52PM +, Kani, Toshimitsu wrote:
> Umm... I was under impression that we are adding the OSC bit check in
> addition to the current GHES filtering.
Read the parallel subthread again.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
On Mon, 2017-07-24 at 14:56 -0300, Mauro Carvalho Chehab wrote:
> Em Mon, 24 Jul 2017 15:56:27 +
:
> That's probably too late for me as I received a new HP machine
> we bought just last week, but for the next time I would need to
> get a new hardware, what would be the non-RAS equivalent to
>
Em Mon, 24 Jul 2017 18:44:00 +0200
Borislav Petkov escreveu:
> On Mon, Jul 24, 2017 at 01:04:13PM -0300, Mauro Carvalho Chehab wrote:
> > If the Kernel force those users to use ghes_edac by default,
> > they they won't see the error counts anymore, but, instead,
> > hardware reports that the memo
Em Mon, 24 Jul 2017 15:56:27 +
"Kani, Toshimitsu" escreveu:
> On Mon, 2017-07-24 at 17:37 +0200, Borislav Petkov wrote:
> > On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote:
> :
> >
> > > We've been providing this model for many years now.
> >
> > Dude, relax, I'm onl
On Mon, 2017-07-24 at 20:50 +0300, Boris Petkov wrote:
> On July 24, 2017 8:44:03 PM GMT+03:00, "Kani, Toshimitsu" @hpe.com> wrote:
> > I assumed our platforms w/o build-in RAS do not implement GHES,
>
> If we make it a normal module, it will be decoupled from GHES and it
> will rely only on the
On July 24, 2017 8:44:03 PM GMT+03:00, "Kani, Toshimitsu"
wrote:
>I assumed our platforms w/o build-in RAS do not implement GHES,
If we make it a normal module, it will be decoupled from GHES and it will rely
only on the whitelist to load.
--
Sent from a small device: formatting sux and brevi
On Mon, 2017-07-24 at 18:37 +0200, Borislav Petkov wrote:
> On Mon, Jul 24, 2017 at 03:56:27PM +, Kani, Toshimitsu wrote:
> > Yes, Mauro has already pointed this out. As I replied to him, we
> > do have a separate series of platforms that do not have built-in
> > RAS, and
>
> So this whitelis
On Mon, Jul 24, 2017 at 01:04:13PM -0300, Mauro Carvalho Chehab wrote:
> If the Kernel force those users to use ghes_edac by default,
> they they won't see the error counts anymore, but, instead,
> hardware reports that the memories need to be replaced.
This is exactly why I'm trying to load ghes_
On Mon, Jul 24, 2017 at 03:56:27PM +, Kani, Toshimitsu wrote:
> Yes, Mauro has already pointed this out. As I replied to him, we do
> have a separate series of platforms that do not have built-in RAS, and
So this whitelist entry
+static struct acpi_oemlist oemlist[] = {
+ {"HPE ", "S
Em Mon, 24 Jul 2017 17:37:16 +0200
Borislav Petkov escreveu:
> > Customers do not see error counts. I do not think it's bogus.
> > I am just trying to enable OS error reporting with ghes_edac.
>
> I know, you don't have to state the obvious constantly.
The problem I see is that, currently,
On Mon, 2017-07-24 at 17:37 +0200, Borislav Petkov wrote:
> On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote:
:
>
> > We've been providing this model for many years now.
>
> Dude, relax, I'm only trying to point out to you that there are
> customers who want to see *every* error
On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote:
> Customers do not see error counts. I do not think it's bogus.
Not showing the real error error counts but something contrived is the
definition of bogus numbers. But you're not showing anything - only when
some thresholds are bei
On Mon, 2017-07-24 at 17:04 +0200, Borislav Petkov wrote:
> On Mon, Jul 24, 2017 at 02:49:30PM +, Kani, Toshimitsu wrote:
> > We do not tell the error counts to customers.
>
> Please read what I said: do you tell your customers that the error
> counts they're seeing (or are *not* seeing) is bo
On Mon, Jul 24, 2017 at 02:49:30PM +, Kani, Toshimitsu wrote:
> We do not tell the error counts to customers.
Please read what I said: do you tell your customers that the error
counts they're seeing (or are *not* seeing) is bogus because the BIOS is
hiding them? Not the *actual* numbers!
> We
On Sat, 2017-07-22 at 08:28 +0200, Borislav Petkov wrote:
> On Fri, Jul 21, 2017 at 06:38:52PM +, Kani, Toshimitsu wrote:
> > Enterprise platforms have very different model (I do not say it's
> > better for everyone from the cost perspective). Typically, such
>
> But you do tell your customer
On Fri, Jul 21, 2017 at 06:38:52PM +, Kani, Toshimitsu wrote:
> Enterprise platforms have very different model (I do not say it's
> better for everyone from the cost perspective). Typically, such
But you do tell your customers that the error counts they see are not
really what *actually* happ
On Fri, 2017-07-21 at 19:23 +0200, Borislav Petkov wrote:
:
> Not only that: thresholds depend on the DIMM types which means,
BIOS
> must know what DIMM types are in there which I doubt.
BIOS knows DIMM model from the SPD data.
> So exposing that to configuration instead of "deciding" for peopl
On Fri, Jul 21, 2017 at 02:01:31PM -0300, Mauro Carvalho Chehab wrote:
> I see the value of having a threshold in BIOS, provided that it is
> well documented, and whose value can be adjusted, if needed.
>
> One of the things I wanted to implement in ras-daemon were an
> algorithm that would be doi
On Fri, 2017-07-21 at 14:01 -0300, Mauro Carvalho Chehab wrote:
> Em Fri, 21 Jul 2017 16:40:20 +
> "Kani, Toshimitsu" escreveu:
>
> > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> > > Em Fri, 21 Jul 2017 15:34:50 +
> > > "Kani, Toshimitsu" escreveu:
> > >
> > > > O
Em Fri, 21 Jul 2017 16:40:20 +
"Kani, Toshimitsu" escreveu:
> On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> > Em Fri, 21 Jul 2017 15:34:50 +
> > "Kani, Toshimitsu" escreveu:
> >
> > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:
> > > > On Fri, Jul 2
On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> Em Fri, 21 Jul 2017 15:34:50 +
> "Kani, Toshimitsu" escreveu:
>
> > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:
> > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu
> > > wrote:
> > > > Yes, that is
On Fri, 2017-07-21 at 17:53 +0200, Borislav Petkov wrote:
> On Fri, Jul 21, 2017 at 03:34:50PM +, Kani, Toshimitsu wrote:
> > I suppose it'd depend on vendors, but I do not think users can do
> > it properly unless they have depth knowledge about the hardware.
>
> I'm talking about a menu in t
On Fri, Jul 21, 2017 at 03:34:50PM +, Kani, Toshimitsu wrote:
> I suppose it'd depend on vendors, but I do not think users can do it
> properly unless they have depth knowledge about the hardware.
I'm talking about a menu in the BIOS where you can set the thresholding
levels on the system. Doe
Em Fri, 21 Jul 2017 15:34:50 +
"Kani, Toshimitsu" escreveu:
> On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:
> > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote:
> > > Yes, that is correct. Corrected errors are reported to the OS when
> > > they exceeded the pla
On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:
> On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote:
> > Yes, that is correct. Corrected errors are reported to the OS when
> > they exceeded the platform's threshold.
>
> Are those thresholds user-configurable?
I suppose i
On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote:
> Yes, that is correct. Corrected errors are reported to the OS when
> they exceeded the platform's threshold.
Are those thresholds user-configurable?
If not, what are you telling users who want to see *every* corrected
error for
On Fri, 2017-07-21 at 15:47 +0200, Borislav Petkov wrote:
> On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab
> wrote:
> > What happens when the error can be corrected? Does it still report
> > it to userspace, or just silently hide the error?
> >
> > If I remember well about a past
On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab wrote:
> What happens when the error can be corrected? Does it still report it to
> userspace, or just silently hide the error?
>
> If I remember well about a past discussion with some vendor, I was told
> that the firmware can hide s
Em Fri, 21 Jul 2017 15:34:41 +0200
Borislav Petkov escreveu:
> On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote:
> > GHES / firmware-first still requires OS recovery actions when an error
> > cannot be corrected by the platform. They are handled by ghes_proc(),
> > and ghes_edac
On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote:
> GHES / firmware-first still requires OS recovery actions when an error
> cannot be corrected by the platform. They are handled by ghes_proc(),
> and ghes_edac remains its error-reporting wrapper.
I mean all the recovery actions t
On Thu, 2017-07-20 at 17:15 -0300, Mauro Carvalho Chehab wrote:
> Em Thu, 20 Jul 2017 19:50:03 +
> "Kani, Toshimitsu" escreveu:
:
> > Firmware has better knowledge about the platform and can provide
> > better RAS when implemented properly. I agree that user
> > experiences may vary on platf
Em Thu, 20 Jul 2017 19:50:03 +
"Kani, Toshimitsu" escreveu:
> On Thu, 2017-07-20 at 06:33 +0200, Borislav Petkov wrote:
> > On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote:
> > > ghes_edac allows to report errors to OS management tools like
> > > rasdaemon in addition to p
On Thu, 2017-07-20 at 06:33 +0200, Borislav Petkov wrote:
> On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote:
> > ghes_edac allows to report errors to OS management tools like
> > rasdaemon in addition to platform- specific managements.
>
> So ghes_edac *is* a poor man's driver in
Em Thu, 20 Jul 2017 19:05:04 +0200
Borislav Petkov escreveu:
> On Thu, Jul 20, 2017 at 04:55:59PM +, Luck, Tony wrote:
> > Add a module parameter to those edac drivers that can override the check
> > and let them load anyway. I'm not paranoid, I just assume that there is a
> > BIOS
> > out
> Or add that parameter to edac_core.ko and let it control which EDAC
> driver gets loaded? Something like
>
> edac=ignore_ghes
>
> or so. And then the other EDAC drivers query it.
Sure ... one central place is better than adding code to each
driver.
-Tony
On Thu, Jul 20, 2017 at 04:55:59PM +, Luck, Tony wrote:
> Add a module parameter to those edac drivers that can override the check
> and let them load anyway. I'm not paranoid, I just assume that there is a
> BIOS
> out there that sets the OSC/WHEA bits, but isn't generating useful GHES logs.
>> Yes, the following message is shown on HP systems. Please note that
>> WHEA is a Windows-defined interface.
>
> Ok, so let's couple ghes_edac loading to that and see how far we could
> go. I guess we should add checks for that to the major x86 EDAC drivers
> to not load and this way ghes_edac w
On Thu, Jul 20, 2017 at 02:42:25PM +, Kani, Toshimitsu wrote:
> Yes, the following message is shown on HP systems. Please note that
> WHEA is a Windows-defined interface.
Ok, so let's couple ghes_edac loading to that and see how far we could
go. I guess we should add checks for that to the ma
On Thu, 2017-07-20 at 06:16 +0200, Borislav Petkov wrote:
> On Wed, Jul 19, 2017 at 04:56:17PM +, Kani, Toshimitsu wrote:
> > Since ghes_edac has not been used for a long time, I have a feeling
> > that not so many vendors want to use it. In the case of HPE, we do
> > not need to update with e
On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote:
> ghes_edac allows to report errors to OS management tools like
> rasdaemon in addition to platform- specific managements.
So ghes_edac *is* a poor man's driver in the sense that it doesn't do
anything fancy but repeat like a parro
On Wed, Jul 19, 2017 at 02:55:08PM -0400, Aristeu Rozanski wrote:
> That would also need to keep an eye on versions. A newer version of BIOS
> on a whitelisted platform might be broken.
Yeah, that would be a nasty, back-stabbing SNAFU.
So I'm thinking of adding a bunch of FW_ERR sanity checks to
On Wed, Jul 19, 2017 at 04:56:17PM +, Kani, Toshimitsu wrote:
> Since ghes_edac has not been used for a long time, I have a feeling
> that not so many vendors want to use it. In the case of HPE, we do not
> need to update with each platform since "HPE" "Server" will cover all
> platforms we ne
On Wed, 2017-07-19 at 14:55 -0400, Aristeu Rozanski wrote:
> On Wed, Jul 19, 2017 at 06:22:04PM +0200, Borislav Petkov wrote:
> > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote:
> > > I do prefer to avoid any white / black listing. But I do not see
> > > how
> > > it solves the b
On Wed, Jul 19, 2017 at 06:22:04PM +0200, Borislav Petkov wrote:
> On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote:
> > I do prefer to avoid any white / black listing. But I do not see how
> > it solves the buggy DMI/SMBIOS info as an example of firmware bugs we
> > may have to de
>> Later when GHES gives you a NODE/CARD/MODULE) in an error record. You need
>> to match these up. But SMBIOS only gave you two strings "Locator" and "Bank
>> Locator" which have no defined syntax. You are at the mercy of the BIOS
>> writer
>> to put in something parseable.
>
> Well, at some poi
On Wed, 2017-07-19 at 18:22 +0200, Borislav Petkov wrote:
> On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote:
> > I do prefer to avoid any white / black listing. But I do not see
> > how it solves the buggy DMI/SMBIOS info as an example of firmware
> > bugs we may have to deal with
On Tue, 2017-07-18 at 18:15 -0300, Mauro Carvalho Chehab wrote:
> Em Tue, 18 Jul 2017 19:58:54 +
:
> We had a similar discussion several years ago when I wrote this
> driver. On that time, I talked with Red Hat, HP, Dell, Intel people
> and with some customers with large clusters.
>
> The way
On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote:
> I do prefer to avoid any white / black listing. But I do not see how
> it solves the buggy DMI/SMBIOS info as an example of firmware bugs we
> may have to deal with.
So how do you want to deal with this?
Maintain an evergrowing
On Wed, 2017-07-19 at 07:52 +0200, Borislav Petkov wrote:
> On Tue, Jul 18, 2017 at 09:20:44PM +, Kani, Toshimitsu wrote:
> > I agree that 'osc_sb_apei_support_acked' should be checked when
> > enabling ghes_edac. I do not know the details of existing issues,
> > but it sounds unlikely that th
On Wed, Jul 19, 2017 at 03:14:32PM +, Luck, Tony wrote:
> Later when GHES gives you a NODE/CARD/MODULE) in an error record. You need
> to match these up. But SMBIOS only gave you two strings "Locator" and "Bank
> Locator" which have no defined syntax. You are at the mercy of the BIOS writer
>
> "The module number of the memory error location. (NODE, CARD, and MODULE
> should provide the information necessary to identify the failing FRU)."
>
> So this tuple is sufficient to pinpoint the DIMM, IIUC.
>
> Which means, ghes_edac can have a single layer of DIMMs without channels.
The tricky
On Tue, Jul 18, 2017 at 10:13:42PM +, Luck, Tony wrote:
> Historically we've had complaints that sb_edac won't load that have been
> tracked to BIOS hiding one of the (many) PCI devices that it needs. But
> device hiding is orthogonal to providing GHES error records. A BIOS might
> do that, b
On Tue, Jul 18, 2017 at 06:15:45PM -0300, Mauro Carvalho Chehab wrote:
> The way it is, ghes_edac is a poor man's driver. What it hopefully
> provide is a detection that an error happened, without really telling
> the user what component should be replaced.
I beg to differ. From the UEFI spec:
"T
On Tue, Jul 18, 2017 at 07:58:54PM +, Kani, Toshimitsu wrote:
> I have HPE Haswell and Skylake test systems with GHES, but they do not
> hide IMCs from the OS. So, the sb_edac and skx_edac drivers get
> attached on these systems when ghes_edac is disabled.
That's how it is supposed to work. T
On Tue, Jul 18, 2017 at 09:20:44PM +, Kani, Toshimitsu wrote:
> I agree that 'osc_sb_apei_support_acked' should be checked when
> enabling ghes_edac. I do not know the details of existing issues, but
> it sounds unlikely that this will address all of them since bugs can be
> everywhere.
No, s
> The question is: does the platform do this disabling now?
>
> Tony, I'm looking at sb_edac and there we don't do something like that
> or maybe I'm missing it.
Historically we've had complaints that sb_edac won't load that have been
tracked to BIOS hiding one of the (many) PCI devices that it ne
On Tue, 2017-07-18 at 10:08 +0200, Borislav Petkov wrote:
> On Tue, Jul 18, 2017 at 08:00:07AM +0200, Borislav Petkov wrote:
> > And I think we should try this first: have the firmware disable
> > detection methods so that the platform drivers don't load.
>
> Btw, in looking at this more, what abo
Em Tue, 18 Jul 2017 19:58:54 +
"Kani, Toshimitsu" escreveu:
> On Tue, 2017-07-18 at 08:00 +0200, Borislav Petkov wrote:
> > On Mon, Jul 17, 2017 at 03:59:12PM -0600, Toshi Kani wrote:
> > > The ghes_edac driver was introduced in 2013 [1], but it has not
> > > been enabled by any distro yet.
On Tue, 2017-07-18 at 08:00 +0200, Borislav Petkov wrote:
> On Mon, Jul 17, 2017 at 03:59:12PM -0600, Toshi Kani wrote:
> > The ghes_edac driver was introduced in 2013 [1], but it has not
> > been enabled by any distro yet. This driver obtains error info
> > from firmware interfaces, which are not
On Tue, 2017-07-18 at 10:24 -0600, Jeffrey Hugo wrote:
> On 7/18/2017 9:36 AM, Kani, Toshimitsu wrote:
> > On Tue, 2017-07-18 at 08:39 -0600, Jeffrey Hugo wrote:
> > > On 7/17/2017 3:59 PM, Toshi Kani wrote:
> > > > The ghes_edac driver was introduced in 2013 [1], but it has not
> > > > been enable
On 7/18/2017 9:36 AM, Kani, Toshimitsu wrote:
On Tue, 2017-07-18 at 08:39 -0600, Jeffrey Hugo wrote:
On 7/17/2017 3:59 PM, Toshi Kani wrote:
The ghes_edac driver was introduced in 2013 [1], but it has not
been enabled by any distro yet.
Ubuntu is expected to enable this soon.
Interesting.
On Tue, 2017-07-18 at 08:39 -0600, Jeffrey Hugo wrote:
> On 7/17/2017 3:59 PM, Toshi Kani wrote:
> > The ghes_edac driver was introduced in 2013 [1], but it has not
> > been enabled by any distro yet.
>
> Ubuntu is expected to enable this soon.
Interesting. I was told from other distro that t
On 7/17/2017 3:59 PM, Toshi Kani wrote:
The ghes_edac driver was introduced in 2013 [1], but it has not
been enabled by any distro yet.
Ubuntu is expected to enable this soon.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technolo
On Tue, Jul 18, 2017 at 08:00:07AM +0200, Borislav Petkov wrote:
> And I think we should try this first: have the firmware disable
> detection methods so that the platform drivers don't load.
Btw, in looking at this more, what about the firmware-first thing?
I.e., the firmware-first detection wit
On Mon, Jul 17, 2017 at 03:59:12PM -0600, Toshi Kani wrote:
> The ghes_edac driver was introduced in 2013 [1], but it has not
> been enabled by any distro yet. This driver obtains error info
> from firmware interfaces, which are not properly implemented on
> many platforms, as the driver always em
67 matches
Mail list logo