On Friday, 20 April 2018 15:11:43 BST R0b0t1 wrote:
> On Fri, Apr 20, 2018 at 7:21 AM, Mick <michaelkintz...@gmail.com> wrote:
> > On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
> >> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
> >> has numerous heat failures.
> >> 
> >> Due to poor cooling ... surprised?
> >> 
> >> The cooling is not working right. Something is still wrong.
> >> 
> >> On 04/19/2018 09:33 PM, R0b0t1 wrote:
> >> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
> >> > cards and a Tesla card.
> >> > 
> >> > The system is a few years old at this point. Old enough that the
> >> > thermal compound could have hardened, which is why I replaced it.
> > 
> > If the problem started suddenly, rather than getting progressively worse
> > over time, it may have something to do with kernel drivers, or some
> > change in firmware.
> 
> As far as I know it has always been like this. It may be why it was
> hardly used before it came into my care. Looking at the server I could
> blame poor design; the inside is rather cramped, despite the care
> taken with the internal baffles. They may not have run a good flow
> simulation.
> 
> Mr. Bird's observation seems to support this.
> 
> > If the cause is mechanical, I'd also suggest checking the heat sink
> > contact
> > surface.  Some heat sinks are poorly manufactured and require flattening
> > with wet 'n dry sandpaper to get a flat enough surface and improve their
> > contact with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler
> > after excess metal was removed from copper pipes, which were manufactured
> > proud.  Hardcore O/C's flatten the CPU too, but I'd avoid anything as
> > radical because it can go badly wrong if you remove more than the surface
> > varnish from the chip.
> > 
> > In the interim, opening the side panel may also help in hot weather.
> 
> The internals are custom made to fit the motherboard, cards, and drive
> slots. It may work better if I move it to another tower but it will be
> a while before I can find one. I will look at the interface between
> the heatsink and processor again, but it looked fine.
> 
> 
> How concerned should I be about overheating machine check errors? I
> used to think that it was best to avoid them, as the threshold was
> high enough that very small parts of the die could overshoot and fail,
> but I was informed that is not the case. Besides the throttling (which
> is fairly bad) I am not sure if there are any drawbacks to the
> overheating.

Semiconductors eventually fail when overheated.  So it is not a good idea to 
continue trying to fry your CPU.

You can confirm the reason of these exceptions by installing and running 'app-
admin/mcelog'.  If the tower design is poor and air circulation within the 
case is creating recirculatory thermal race conditions, your choices would 
typically be:

1. Install more effective after market CPU coolers.  This means you have to 
spend money, which may be better spent on a new tower/PC.  It may also be 
there isn't enough space in the case to fit them, although low profile/compact 
CPU coolers exist and you may have better luck with them.

2. Install bigger or additional case fans, to help getting the heated air out 
of the case and minimising hot spots and hot air recirculation.  You could try 
forcing some more air through the case with a small desktop fan to see if this 
option has any legs.

3. Modify the case, by drilling/cutting holes to improve air flow, e.g. at the 
top of the case.

4. Migrating components to a diffent case/MoBo, which you have already 
considered.


> I am wondering what the point of 32 threads is if you can't use them at
> 100%.
> 
> Cheers,
>      R0b0t1

Quite, but the box may have not been intended to come across the pressures of 
running gentoo to compile software on a regular basis.  I've found many 
cheaper laptops in particular are so poorly designed from a cooling 
perspective, they struggle to run a lengthy gentoo emerge.  I've also had 
desktops which struggled, although nothing as critical as yours.  The 
permanent solutions I came up with involved after market cooling fans.  With 
boxen I was not keen to spend money on for cooling improvements I would just 
open the side panel during an emerge, which allowed the CPU temperature to 
drop sufficiently to avoid further thermal throttling.

-- 
Regards,
Mick

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to