Re: [gentoo-user] machine check exception errors
Thanks Mick. My host is big with multiple data centers of their own. They did exactly as I asked and I'm running on new RAM. There was a problem bringing my system back online and the cause was purported to be an unseated ethernet cable. I handed over my root password as I was requested to do, and then started to get paranoid. I suppose I shouldn't though because with physical access to my machine they pretty much have full access anyway, right? Usually, physical access means they either have it or can get it pretty quick. Boot a CD/DVD, mount the partitions, chroot in, change password and reboot. Then, you don't have the password but they do. That's pretty obvious though. Physical access allows them to change your password but not read it, so you'd know pretty soon if they'd been up to anything. If they really do need the root password, you have to give it to them, but that doesn't stop you changing it, and running a rootkit scan, as soon as they've finished with it. I've run chkrootkit, but I noticed: The file of stored file properties (rkhunter.dat) does not exist, and so must be created. To do this type in 'rkhunter --propupd'. I thought the best practice with a rootkit checker like chkrootkit was to not leave it installed on the system so you can run it as a clean install when the time comes? Do any of these warnings sound an alarm for anyone? I think the SSH warnings are OK because I have a normal user specified with AllowUsers and the config file says: # The default requires explicit activation of protocol 1 #Protocol 2 Here are the warnings: Warning: The command '/usr/bin/ldd' has been replaced by a script: /usr/bin/ldd: Bourne-Again shell script text executable Warning: The command '/usr/bin/whatis' has been replaced by a script: /usr/bin/whatis: POSIX shell script text executable Warning: The command '/usr/bin/lwp-request' has been replaced by a script: /usr/bin/lwp-request: a /usr/bin/perl -w script text executable Warning: No output found from the lsmod command or the /proc/modules file: /proc/modules output: lsmod output: Warning: The SSH configuration option 'PermitRootLogin' has not been set. The default value may be 'yes', to allow root access. Warning: The SSH configuration option 'Protocol' has not been set. The default value may be '2,1', to allow the use of protocol version 1. Warning: Hidden directory found: /dev/.udev - Grant
Re: [gentoo-user] machine check exception errors
On Tuesday 21 September 2010, Stroller wrote: On 21 Sep 2010, at 18:37, Grant wrote: I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: ... They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? I suspect that your hosting provider are offering you this memory test because they don't want to go swapping out memory modules willy-nilly. How do they know that the problem is really memory, and not your operating system? If they take all this RAM out and put new RAM in, what do they do with the old RAM? They don't know if it's good or bad, so are they expected to just slap it in a server belonging to another customer, and stitch him up? A memory test is likely to identify bad RAM, if it is bad, so you should proceed with this. This is likely the best route to solving the problem. sure? this is ecc ram - does memtest report ecc-corrected errors? i don't think so. The mce errors say: we detected an error. Error was corrected. Applications will not see error. Everything marches on. The ram is borked and must be replaced.
Re: [gentoo-user] machine check exception errors
On Wed, 22 Sep 2010 23:26:09 -0500, Dale wrote: Thanks Mick. My host is big with multiple data centers of their own. They did exactly as I asked and I'm running on new RAM. There was a problem bringing my system back online and the cause was purported to be an unseated ethernet cable. I handed over my root password as I was requested to do, and then started to get paranoid. I suppose I shouldn't though because with physical access to my machine they pretty much have full access anyway, right? Usually, physical access means they either have it or can get it pretty quick. Boot a CD/DVD, mount the partitions, chroot in, change password and reboot. Then, you don't have the password but they do. That's pretty obvious though. Physical access allows them to change your password but not read it, so you'd know pretty soon if they'd been up to anything. If they really do need the root password, you have to give it to them, but that doesn't stop you changing it, and running a rootkit scan, as soon as they've finished with it. -- Neil Bothwick God said, div D = rho, div B = 0, curl E = - @B/@t, curl H = J + @D/@t, and there was light. signature.asc Description: PGP signature
Re: [gentoo-user] machine check exception errors
On Wednesday 22 September 2010 02:24:39 Grant wrote: I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: ... They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? I suspect that your hosting provider are offering you this memory test because they don't want to go swapping out memory modules willy-nilly. How do they know that the problem is really memory, and not your operating system? If they take all this RAM out and put new RAM in, what do they do with the old RAM? They don't know if it's good or bad, so are they expected to just slap it in a server belonging to another customer, and stitch him up? A memory test is likely to identify bad RAM, if it is bad, so you should proceed with this. This is likely the best route to solving the problem. I think that ideally, for you, they would move the system image onto a different known-good server with the same configuration. Then you cannot complain if the same problems start occurring again. If the problem is genuinely hardware then they won't. And the hosting provider is free to run diagnostics on your old machine. But realistically, the memory test is likely to show up a bad RAM module, you'll get it replaced and be up and running within a few hours. Why would you refuse? If your system needed a guaranteed uptime you'd perhaps have to pay for a higher level of service than the fees you're paying at present. I run memory tests overnight. If a module is seriously borked then it will fail earlier. Reseating/replacing takes a few minutes, instead of hours. If they have spare machines (for dev't or testing) they can fit the memory module(s) there and test them exhaustively, before they put the good ones back into a customer's machine. Thanks Mick and Stroller. I'll see if they'll go for this. You're welcome. Bear in mind though that a lot of hosters are just glorified resellers with an account in a bigger data centre. In many cases they do not even have physical access to the machines. Only the data centre techies do and they may be less willing to oblige and break procedure or routine, just because one end user out of hundreds/thousands complained about some memory errors. YMMV -- Regards, Mick signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] machine check exception errors
I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: ... They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? I suspect that your hosting provider are offering you this memory test because they don't want to go swapping out memory modules willy-nilly. How do they know that the problem is really memory, and not your operating system? If they take all this RAM out and put new RAM in, what do they do with the old RAM? They don't know if it's good or bad, so are they expected to just slap it in a server belonging to another customer, and stitch him up? A memory test is likely to identify bad RAM, if it is bad, so you should proceed with this. This is likely the best route to solving the problem. I think that ideally, for you, they would move the system image onto a different known-good server with the same configuration. Then you cannot complain if the same problems start occurring again. If the problem is genuinely hardware then they won't. And the hosting provider is free to run diagnostics on your old machine. But realistically, the memory test is likely to show up a bad RAM module, you'll get it replaced and be up and running within a few hours. Why would you refuse? If your system needed a guaranteed uptime you'd perhaps have to pay for a higher level of service than the fees you're paying at present. I run memory tests overnight. If a module is seriously borked then it will fail earlier. Reseating/replacing takes a few minutes, instead of hours. If they have spare machines (for dev't or testing) they can fit the memory module(s) there and test them exhaustively, before they put the good ones back into a customer's machine. Thanks Mick and Stroller. I'll see if they'll go for this. You're welcome. Bear in mind though that a lot of hosters are just glorified resellers with an account in a bigger data centre. In many cases they do not even have physical access to the machines. Only the data centre techies do and they may be less willing to oblige and break procedure or routine, just because one end user out of hundreds/thousands complained about some memory errors. Thanks Mick. My host is big with multiple data centers of their own. They did exactly as I asked and I'm running on new RAM. There was a problem bringing my system back online and the cause was purported to be an unseated ethernet cable. I handed over my root password as I was requested to do, and then started to get paranoid. I suppose I shouldn't though because with physical access to my machine they pretty much have full access anyway, right? - Grant
Re: [gentoo-user] machine check exception errors
Grant wrote: Thanks Mick. My host is big with multiple data centers of their own. They did exactly as I asked and I'm running on new RAM. There was a problem bringing my system back online and the cause was purported to be an unseated ethernet cable. I handed over my root password as I was requested to do, and then started to get paranoid. I suppose I shouldn't though because with physical access to my machine they pretty much have full access anyway, right? - Grant Usually, physical access means they either have it or can get it pretty quick. Boot a CD/DVD, mount the partitions, chroot in, change password and reboot. Then, you don't have the password but they do. My conspiracy hat on, if you can't trust them with the password, why do they have your data? Just thinking. ;-) This leaves out the encryption thing tho. That would change things. Dale :-) :-)
Re: [gentoo-user] machine check exception errors
I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: # mcelog HARDWARE ERROR. This is *NOT* a software problem! [...] Should I just contact the hosting company? Can anyone give me more info on what this means? Bad memory? They are likely better able to help you if it's a hardware problem. It reads as if the error correction in one of the RAM modules is kicking in. Ask them to reseat or replace the bad module - which they will have to find by trial and error. They could hot-swap them and see then the errors stop. -- Regards, Mick They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? - Grant
Re: [gentoo-user] machine check exception errors
On 21 Sep 2010, at 18:37, Grant wrote: I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: ... They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? I suspect that your hosting provider are offering you this memory test because they don't want to go swapping out memory modules willy-nilly. How do they know that the problem is really memory, and not your operating system? If they take all this RAM out and put new RAM in, what do they do with the old RAM? They don't know if it's good or bad, so are they expected to just slap it in a server belonging to another customer, and stitch him up? A memory test is likely to identify bad RAM, if it is bad, so you should proceed with this. This is likely the best route to solving the problem. I think that ideally, for you, they would move the system image onto a different known-good server with the same configuration. Then you cannot complain if the same problems start occurring again. If the problem is genuinely hardware then they won't. And the hosting provider is free to run diagnostics on your old machine. But realistically, the memory test is likely to show up a bad RAM module, you'll get it replaced and be up and running within a few hours. Why would you refuse? If your system needed a guaranteed uptime you'd perhaps have to pay for a higher level of service than the fees you're paying at present. Stroller.
Re: [gentoo-user] machine check exception errors
On Tuesday 21 September 2010 20:15:05 Stroller wrote: On 21 Sep 2010, at 18:37, Grant wrote: I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: ... They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? I suspect that your hosting provider are offering you this memory test because they don't want to go swapping out memory modules willy-nilly. How do they know that the problem is really memory, and not your operating system? If they take all this RAM out and put new RAM in, what do they do with the old RAM? They don't know if it's good or bad, so are they expected to just slap it in a server belonging to another customer, and stitch him up? A memory test is likely to identify bad RAM, if it is bad, so you should proceed with this. This is likely the best route to solving the problem. I think that ideally, for you, they would move the system image onto a different known-good server with the same configuration. Then you cannot complain if the same problems start occurring again. If the problem is genuinely hardware then they won't. And the hosting provider is free to run diagnostics on your old machine. But realistically, the memory test is likely to show up a bad RAM module, you'll get it replaced and be up and running within a few hours. Why would you refuse? If your system needed a guaranteed uptime you'd perhaps have to pay for a higher level of service than the fees you're paying at present. I run memory tests overnight. If a module is seriously borked then it will fail earlier. Reseating/replacing takes a few minutes, instead of hours. If they have spare machines (for dev't or testing) they can fit the memory module(s) there and test them exhaustively, before they put the good ones back into a customer's machine. -- Regards, Mick signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] machine check exception errors
I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: ... They offered to take my machine down and do a memory test which they said would take a number of hours. Is a memory test likely to help? Did you suggest reseating or replacing RAM modules as opposed to a memory test because it will result in less downtime? I suspect that your hosting provider are offering you this memory test because they don't want to go swapping out memory modules willy-nilly. How do they know that the problem is really memory, and not your operating system? If they take all this RAM out and put new RAM in, what do they do with the old RAM? They don't know if it's good or bad, so are they expected to just slap it in a server belonging to another customer, and stitch him up? A memory test is likely to identify bad RAM, if it is bad, so you should proceed with this. This is likely the best route to solving the problem. I think that ideally, for you, they would move the system image onto a different known-good server with the same configuration. Then you cannot complain if the same problems start occurring again. If the problem is genuinely hardware then they won't. And the hosting provider is free to run diagnostics on your old machine. But realistically, the memory test is likely to show up a bad RAM module, you'll get it replaced and be up and running within a few hours. Why would you refuse? If your system needed a guaranteed uptime you'd perhaps have to pay for a higher level of service than the fees you're paying at present. I run memory tests overnight. If a module is seriously borked then it will fail earlier. Reseating/replacing takes a few minutes, instead of hours. If they have spare machines (for dev't or testing) they can fit the memory module(s) there and test them exhaustively, before they put the good ones back into a customer's machine. Thanks Mick and Stroller. I'll see if they'll go for this. - Grant
Re: [gentoo-user] machine check exception errors
On Tuesday 14 September 2010 19:16:52 Albert Hopkins wrote: On Tue, 2010-09-14 at 09:45 -0700, Grant wrote: I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: # mcelog HARDWARE ERROR. This is *NOT* a software problem! [...] Should I just contact the hosting company? Can anyone give me more info on what this means? Bad memory? They are likely better able to help you if it's a hardware problem. It reads as if the error correction in one of the RAM modules is kicking in. Ask them to reseat or replace the bad module - which they will have to find by trial and error. They could hot-swap them and see then the errors stop. -- Regards, Mick signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] machine check exception errors
On Tue, 2010-09-14 at 09:45 -0700, Grant wrote: I'm getting a lot of machine check exception errors in dmesg on my hosted server. Running mcelog I get: # mcelog HARDWARE ERROR. This is *NOT* a software problem! [...] Should I just contact the hosting company? Can anyone give me more info on what this means? Bad memory? They are likely better able to help you if it's a hardware problem.