Re: Disk reliability - and swap out`

Yasha Karant Tue, 10 Aug 2021 17:20:34 -0700

Not all locations are as strict as BC, and below 50V (DC as well?) cancause fires, etc. -- those pesky Li battery issues that you may recall.My reference to power supplies was to all power supplies that have arisk of "high" temperature, fire, or explosion.


On 8/10/21 4:43 PM, Konstantin Olchanski wrote:



In the Province of British Columbia, AC-side electrical equipment
is designed by registered Professional Engineers (PE) and worked
on by licensed electricians. At TRIUMF these people are better
than average.

Equipment we have built for SNOLAB and CERN was certified
by an outside licensed inspector and has the CSA and CE marks. Inspectors
were pretty strict and required a few changes in order to pass muster.

Equipment below 50V generally does not need to be inspected/certified,
but I, as an end-user, always ask "where is the fuse?" and "is this
a fuse of the right type?" (I know there are different types of fuses
but it is not my job to select the correct one).

We take electrical safety very seriously.

P.S. You cannot use a quarter to jumper blown fuses on current
generation electronics. You are lucky if you can even see
the fuse without a microscope. Replacing blown fuses is generally
done by electronics technicians and they will not honor the request
to "just jumper it!".

P.P.S. A physics lab is not Boeing with their "only flies down" airplanes
and "what time is now?" space ships.


K.O.


On Tue, Aug 10, 2021 at 04:10:26PM -0700, Yasha Karant wrote:

A proper circuit breaker, hopefully with external or simple panel
removal access (not remove from rack, open chassis, remove ... ),
will work fine and typically is better than a fuse.  A "soldered in
place" fusible link also will work, but is much more difficult to
service and replace.  Anyone who puts a jumper over an overcurrent
device (the "coin in the fuse box"), other than for diagnostic
testing, needs to be both educated and reprimanded.  Note that if
there is a power supply (typically from the mains), one needs a
circuit breaker both for the power supply and for the items that the
power supply is supplying. Clearly, the safety engineering unit is
not verifying that any custom apparatus meets basic fail-safe
practices -- I am not suggesting actual UL, etc., testing and
certification for a specific experimental data or control device
(although I do look for such certifications on the actual circuit
breaker -- accidents are not nice).

On 8/10/21 4:00 PM, Konstantin Olchanski wrote:

On Tue, Aug 10, 2021 at 03:34:00PM -0700, Yasha Karant wrote:

One SSD had an internal short and turned into a space heater,
luckily there was no fire. End excerpt.

Clearly, there is very poor safety engineering and/or quality
control


you will not be amused to learn how many electronics lack
proper fuses and protections against internal and external
shorts. even here, I have seen good people forget to put fuses
on newly built boards.

(as with certain Li batteries that did similar things in
personal devices being operated by the user).


that's different. SSD stores bits, Li battery stores Joules,
and "bits do not burn".

If that SSD had been inside a laptop (presumably, inside a rack mounted
disk farm and there were fire extinguishers and possibly a machine room fire
suppression system), things could have had a much worse outcome
(most laptops have combustible materials).


tangled server rooms, laptops, men, guns, horses all together.

laptop battery probably will not have enough oomph for a good SSD fire,
cannot supply enough Amps, will shutdown before things get hot. ditto
for laptop power supply (60 W vs 600 W PC power supply).

server chassis with rack mounted SSD in a server room has such good cooling
that the shorted SSD will only get slightly warm. also server power supply
will probably shutdown quickly because of undervoltage condition. so no fire.

in this particular case, the computer was in an experimental area,
that has combustible materials, etc.


As for the small amount of storage, the commentator is at a
reasonably well funded (through government sources and possible
tax-deductible or glamour philanthropy) HEP facility.


We also have a $$$ printing press in our basement (I have a key!) and
we can transmute lead into gold (only slightly radioactive).

K.O.


Much of the world, including non-collaboration funded university research
facilities have rather poor funding at most entities within the USA
(not all faculty members can be at Harvard, Stanford, etc.) --
administrative and some instructional facilities typically can get
much more.  Many universities now outsource to paid "cloud" storage,
with all of the issues that may entail.

On 8/10/21 3:08 PM, Konstantin Olchanski wrote:

Hi, Larry, thank you for this information, it is always good to see
how other people do things.

I am surprised at how little storage you have, only a handful of TBs.

Here, for each experiment data acquisition station, we now configure
2x1TB SSD for os, home dirs, apps, etc and 2x8-10-12TB HDD for recording
experiment data. We use "sort by price" NAS CMR HDDs (WD red, etc).

All disks are doubled up as linux mdadm raid1 (mirror) or ZFS mirror. This is
to prevent any disruption of data taking from single-disk failure.

(it is important to configure the boot loader on both SSDs to boot
even if the other SSD is dead).

I am surprised you use 1TB HDDs. We switched to SSD up to 2TB size (WD blue 
SATA SSDs).

Failure rates of HDDs, the only reliable data is from backblaze:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.backblaze.com_b2_hard-2Ddrive-2Dtest-2Ddata.html&d=DwIBAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=NXYkiOfF7bPKBqi2iMgqsqrtLHRVdP7lIO-L5J4AmqQ&s=DgUuM1BVcm4jUkUWsi_DNMAjvkuy1zl1oaDQzrC4YAk&e=
and
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.backblaze.com_blog_backblaze-2Ddrive-2Dstats-2Dfor-2Dq2-2D2021_&d=DwIBAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=NXYkiOfF7bPKBqi2iMgqsqrtLHRVdP7lIO-L5J4AmqQ&s=lPk2j2mTwp6uDzrZYUsP2rIxyRiacBHZOU0o7R5mUqM&e=

Failure rates of SSDs, seems to be very low, I only have 2-3 failed SSDs. One 
SSD had an
internal short and turned into a space heater, luckily there was no fire.

For backups of os and home dirs we use amanda and rsync+zfs snapshots. Backups
of experiment data is not our responibility (many experiments use usb hdds).


K.O.


On Tue, Aug 10, 2021 at 10:55:35AM -0400, Larry Linder wrote:

There are 25 systems in our shop, all linux based, a linux based server,
and synology Disk Station running raid 1.   The Disk Station has 12 TB
of space.  6 TB per for each raid level.

We buy only one brand of disk with the black label.  They are typically
1 TB.

User boxes has a SSD drive for the OS and a 2 TB disk for the users
space and 32 G RAM. and a quad or six core AMD processor.  The graphics
boxes get a Video card with lots of ram.  3 D rendering on a slow video
care wast's a lot of users time.

The server has a SSD for the OS and 6 TB for user apps /
library /usr/local and /opt.  It also has a mirror disk that keeps a
copy of the server locally.

These systems are on 24 / 7 and accumulate a lot of hours.  No matter
what the make mechanical disks have a life span.  For grins I used to do
a post mortum on disk that failed.   There were to types of failures,
the spring that returns the arm holding the heads cracks.  The second
type of failure is the main bearings.  Newer disk seem to have less of a
bearing failure rate.
To prevent operational problems we just swap out the disk on each box at
about 5,000 to 7,000 hr.  The manufacturer says they are good for 10,000
hr. See the fine print in the Waranty,  You have to remember this is a
money making operation and down time is costly.

Backups run at 12:29 and 0:29 in the AM.  At the end of the morning back
up a copy is sent to a remote site.

For security we shut down the network at 6:20 PM, bring it up at 0:01 AM
and shut it down after back up is complete.  We bring it back up at 6:45
AM.
10 yeas ago we had a fixed IP and the Chineese found it by just
continually pounding on the door.  The return IP was 4 hops to a city
north east of Shanghi.  They had installed a root kit on our server,
disabled cron.  When you changed the passwd to the server a few
millisecond later it was sent to china.  We got rid of the fixed IP and
reloaded all the systems.  So when you shout down the network to your
provider the next time your start it you get a different IP.

We don't give the disks away as they contain a lot of design data,
SW,Cad programs, part programs for our mill etc.  We donate them to a
charity that drills the disks and recycles the rest.

Larry Linder

Re: Disk reliability - and swap out`

Reply via email to