Re: [CentOS] HDD badblocks

2016-01-23 Thread Chris Murphy
On Thu, Jan 21, 2016 at 9:27 AM, Lamar Owen  wrote:
> On 01/20/2016 01:43 PM, Chris Murphy wrote:
>>
>> On Wed, Jan 20, 2016, 7:17 AM Lamar Owen  wrote:
>>
>>> The standard Unix way of refreshing the disk contents is with badblocks'
>>> non-destructive read-write test (badblocks -n or as the -cc option to
>>> e2fsck, for ext2/3/4 filesystems).
>>
>>
>> This isn't applicable to RAID, which is what this thread is about. For
>> RAID, use scrub, that's what is for.
>
>
> The badblocks read/write verification would need to be done on the RAID
> member devices, not the aggregate md device, for member device level remap.
> It might need to be done with the md offline, not sure.  Scrub?  There is a
> scrub command (and package) in CentOS, but it's meant for secure data
> erasure, and is not a non-destructive thing.  Ah, you're talking about what
> md will do if 'check' or 'repair' is written to the appropriate location in
> the sysfs for the md in question.  (This info is in the md(4) man page).


Correct.




-- 
Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-21 Thread Lamar Owen

On 01/20/2016 01:43 PM, Chris Murphy wrote:

On Wed, Jan 20, 2016, 7:17 AM Lamar Owen  wrote:

The standard Unix way of refreshing the disk contents is with 
badblocks' non-destructive read-write test (badblocks -n or as the 
-cc option to e2fsck, for ext2/3/4 filesystems). 


This isn't applicable to RAID, which is what this thread is about. For
RAID, use scrub, that's what is for.


The badblocks read/write verification would need to be done on the RAID 
member devices, not the aggregate md device, for member device level 
remap.  It might need to be done with the md offline, not sure.  Scrub?  
There is a scrub command (and package) in CentOS, but it's meant for 
secure data erasure, and is not a non-destructive thing.  Ah, you're 
talking about what md will do if 'check' or 'repair' is written to the 
appropriate location in the sysfs for the md in question.  (This info is 
in the md(4) man page).



The badblocks method fixes nothing if the sector is persistently bad and
the drive reports a read error.


The badblocks method will do a one-off read/write verification on a 
member device; no, it won't do it automatically, true enough.



It fixes nothing if the command timeout is
reached before the drive either recovers or reports a read error.


Very true.


And even
if it works, you're relying on ECC recovered data rather than reading a
likely good copy from mirror or parity and writing that back to the bad
block.


Yes, for the member drive this is true.  Since my storage here is 
primarily on EMC Clariion, I'm not sure what the equivalent to EMC's 
background verify would be for mdraid, since I've not needed that 
functionality from mdraid.  (I really don't like the term 'software 
RAID' since at some level all RAID is software RAID, whether on a 
storage processor or in the RAID controller's firmware.).  It does 
appear that triggering a scrub from sysfs for a particular md might be 
similar functionality, and would do the remap if inconsistent data is 
found.  This is a bit different from the old Unix way, but these are 
newer times and such the way of doing things is different.



But all of this still requires the proper configuration.

Yes, this is very true.

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-20 Thread Chris Murphy
On Wed, Jan 20, 2016, 7:17 AM Lamar Owen  wrote:

> On 01/19/2016 06:46 PM, Chris Murphy wrote:
> > Hence, bad sectors accumulate. And the consequence of this often
> > doesn't get figured out until a user looks at kernel messages and sees
> > a bunch of hard link resets
>
> The standard Unix way of refreshing the disk contents is with badblocks'
> non-destructive read-write test (badblocks -n or as the -cc option to
> e2fsck, for ext2/3/4 filesystems).


This isn't applicable to RAID, which is what this thread is about. For
RAID, use scrub, that's what is for.

The badblocks method fixes nothing if the sector is persistently bad and
the drive reports a read error. It fixes nothing if the command timeout is
reached before the drive either recovers or reports a read error. And even
if it works, you're relying on ECC recovered data rather than reading a
likely good copy from mirror or parity and writing that back to the bad
block.

But all of this still requires the proper configuration.


The remap will happen on the
> writeback of the contents.  It's been this way with enterprise SCSI
> drives for as long as I can remember there being enterprise-class SCSI
> drives.  ATA drives caught up with the SCSI ones back in the early 90's
> with this feature.  But it's always been true, to the best of my
> recollection, that the remap always happens on a write.


Properly configured, first a read error happens which includes the LBA of
the bad sector. The md driver needs that LBA to know how to find a good
copy of data from mirror or from parity. *Then* it weird to the bad LBA.

In the case of misconfiguration, the command timeout expiration and link
reset prevents the kernel from knowing the LBA if the bad sector and
therefore repair isn't possible.


The rationale
> is pretty simple: only on a write error does the drive know that it has
> the valid data in its buffer, and so that's the only safe time to put
> the data elsewhere.
>
> > This problem affects all software raid, including btrfs raid1. The
> > ideal scenario is you'll use 'smartctl -l scterc,70,70 /dev/sdX' in
> > startup script, so the drive fails reads on marginally bad sectors
> > with an error in 7 seconds maximum.
> >
> This is partly why enterprise arrays manage their own per-sector ECC and
> use 528-byte sector sizes.



Not all enterprise drives have 520/528 byte sectors. Those that do are
using T10-PI (formerly DIF) and it requires software support too. It's
pretty rare. It's 8000% easier to use ZFS on Linux or Btrfs.




> But the other fact of life of modern consumer-level hard drives is that
> *errored sectors are expected* and not exceptions.  Why else would a
> drive have a TLER in the two minute range like many of the WD Green
> drives do?  And with a consumer-level drive I would be shocked if
> badblocks reported the same number each time it ran through.
>

All drives expect bad sectors. Consumer drives reporting a read error will
put the host OS into an inconsistent state, so it should be avoided.
Becoming slow is better than implosion. And neither OS X or Windows do link
resets after merely 30 seconds either.


Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-20 Thread James B. Byrne

On Tue, January 19, 2016 18:36, John R Pierce wrote:
> On 1/19/2016 3:29 PM, J Martin Rushton wrote:
>> I suspect that the gold layer on edge connectors 30-odd years ago
>> was
>> a lot thicker than on modern cards.  We are talking contacts on 0.1"
>> spacing not some modern 1/10 of a knat's whisker.  (Off topic) I
>> also
>> remember seeing engineers determine which memory chip was at fault
>> and
>> replacing the chip using a soldering iron.  Try that on a DIMM!
>
> indeed, I pretty much quit doing component level electronics when
> everything went to surface mount.
>
>

Kids these days!  I remember taking the vacuum tubes to the testing
centre in the corner drug-store to see which ones need replacing.

Apologies to the four Yorkshiremen.


-- 
***  e-Mail is NOT a SECURE channel  ***
Do NOT transmit sensitive data via e-Mail
James B. Byrnemailto:byrn...@harte-lyne.ca
Harte & Lyne Limited  http://www.harte-lyne.ca
9 Brockley Drive  vox: +1 905 561 1241
Hamilton, Ontario fax: +1 905 561 0757
Canada  L8E 3C3

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-20 Thread Lamar Owen

On 01/19/2016 06:29 PM, J Martin Rushton wrote:

(Off topic) I also
remember seeing engineers determine which memory chip was at fault and
replacing the chip using a soldering iron.  Try that on a DIMM!


As long as the DIMM isn't populated with BGA packages it's about a 
ten-minute job with a hot air rework station, which will only cost you 
around $100 or so if you shop around (and if you have a relatively 
steady hand and either good eyes or a good magnifier). It's doable in a 
DIY way even with BGA, but takes longer and you need a reballing mask 
for that specific package to make it work right.  Any accurately 
controlled oven is good enough to do the reflow (and baking Xbox boards 
is essentially doing a reflow..)


Yeah, I prefer tubes and discretes and through-hole PCB's myself, but at 
this point I've acquired a hot air station and am getting up to speed on 
surface mount, and am finding that it's not really that hard, just 
different.


This is not that different from getting up to speed with something 
really new and different, like systemd.  It just requires being willing 
to take a different approach to the problem.  BGA 
desoldering/resoldering requires a whole different way of looking at the 
soldering operation, that's all.


___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-20 Thread Lamar Owen

On 01/19/2016 06:46 PM, Chris Murphy wrote:

Hence, bad sectors accumulate. And the consequence of this often
doesn't get figured out until a user looks at kernel messages and sees
a bunch of hard link resets


The standard Unix way of refreshing the disk contents is with badblocks' 
non-destructive read-write test (badblocks -n or as the -cc option to 
e2fsck, for ext2/3/4 filesystems).  The remap will happen on the 
writeback of the contents.  It's been this way with enterprise SCSI 
drives for as long as I can remember there being enterprise-class SCSI 
drives.  ATA drives caught up with the SCSI ones back in the early 90's 
with this feature.  But it's always been true, to the best of my 
recollection, that the remap always happens on a write.  The rationale 
is pretty simple: only on a write error does the drive know that it has 
the valid data in its buffer, and so that's the only safe time to put 
the data elsewhere.



This problem affects all software raid, including btrfs raid1. The
ideal scenario is you'll use 'smartctl -l scterc,70,70 /dev/sdX' in
startup script, so the drive fails reads on marginally bad sectors
with an error in 7 seconds maximum.

This is partly why enterprise arrays manage their own per-sector ECC and 
use 528-byte sector sizes.  The drives for these arrays make very poor 
workstation standalone drives, since the drive is no longer doing all 
the error recovery itself, but relying on the storage processor to do 
the work.  Now, the drive is still doing some basic ECC on the sector, 
but the storage processor is getting a much better idea of the health of 
each sector than when the drive's firmware is managing remap.  
Sophisticated enterprise arrays, like NetApp's, EMC's, and Nimble's, can 
do some very accurate predictions and proactive hotsparing when needed.  
That's part of what you pay for when you buy that sort of array.


But the other fact of life of modern consumer-level hard drives is that 
*errored sectors are expected* and not exceptions.  Why else would a 
drive have a TLER in the two minute range like many of the WD Green 
drives do?  And with a consumer-level drive I would be shocked if 
badblocks reported the same number each time it ran through.


___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread John R Pierce

On 1/19/2016 2:24 PM, Warren Young wrote:

It’s dying.  Replace it now.


agreed


On a modern hard disk, you should*never*  see bad sectors, because the drive is 
busy hiding all the bad sectors it does find, then telling you everything is 
fine.


thats not actually true.the drive will report 'bad sector' if you 
try and read data that the drive simply can't read.   you wouldn't want 
it to return bad data and say its OK. many(most?) drives won't 
actually remap to a bad sector until you write new data over that block 
number, since they don't want to copy bad data without any way of 
telling the OS the data is invalid. these pending remaps are listed 
under smart parameter 197 Current_Pending_Sector




--
john r pierce, recycling bits in santa cruz

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread m . roth
Chris Murphy wrote:
> On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi
> 
> wrote:
>> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>> > What is the result for each drive?
>> >
>> > smartctl -l scterc 
>> >
>> SCT Error Recovery Control command not supported
>>
> The drive is disqualified unless your usecase can tolerate the possibly
> very high error recovery time for these drives.
>
> Do a search for Red Hat documentation on the SCSI Command Timer. By
> default
> this is 30 seconds. You'll have to raise this to 120 out maybe even 180
> depending on the maximum time the drive attempts to recover. The SCSI
> Command Timer is a kernel seeing per block device. Basically it's giving
> up, and resetting the link to drive because while the drive is in deep
> recovery it doesn't respond to anything.
>
Replace the drive. Yesterday.

 mark

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Chris Murphy
On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi 
wrote:

> Il 18/01/2016 12:09, Chris Murphy ha scritto:
> > What is the result for each drive?
> >
> > smartctl -l scterc 
> >
> >
> > Chris Murphy
> > ___
> > CentOS mailing list
> > CentOS@centos.org
> > https://lists.centos.org/mailman/listinfo/centos
> > .
> >
> SCT Error Recovery Control command not supported
>



The drive is disqualified unless your usecase can tolerate the possibly
very high error recovery time for these drives.

Do a search for Red Hat documentation on the SCSI Command Timer. By default
this is 30 seconds. You'll have to raise this to 120 out maybe even 180
depending on the maximum time the drive attempts to recover. The SCSI
Command Timer is a kernel seeing per block device. Basically it's giving
up, and resetting the link to drive because while the drive is in deep
recovery it doesn't respond to anything.




Chris Murphy




___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos
>
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Chris Murphy
On Tue, Jan 19, 2016, 3:30 PM   wrote:

> Chris Murphy wrote:
> > On Mon, Jan 18, 2016, 4:39 AM Alessandro Baggi
> > 
> > wrote:
> >> Il 18/01/2016 12:09, Chris Murphy ha scritto:
> >> > What is the result for each drive?
> >> >
> >> > smartctl -l scterc 
> >> >
> >> SCT Error Recovery Control command not supported
> >>
> > The drive is disqualified unless your usecase can tolerate the possibly
> > very high error recovery time for these drives.
> >
> > Do a search for Red Hat documentation on the SCSI Command Timer. By
> > default
> > this is 30 seconds. You'll have to raise this to 120 out maybe even 180
> > depending on the maximum time the drive attempts to recover. The SCSI
> > Command Timer is a kernel seeing per block device. Basically it's giving
> > up, and resetting the link to drive because while the drive is in deep
> > recovery it doesn't respond to anything.
> >
> Replace the drive. Yesterday.


That's just masking the problem, his setup will still be misconfigured for
RAID.

It's a 512e AF drive? If so, the bad sector count is inflated by 8. In
reality less than 15 sectors are bad. And none have been reallocated due to
misconfiguration.


Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Warren Young
On Jan 17, 2016, at 9:59 AM, Alessandro Baggi  
wrote:
> 
> On sdb there are not problem but with sda:
> 
> 1) First run badblocks reports 28 badblocks on disk
> 2) Second run badblocks reports 32 badblocks
> 3) Third reports 102 badblocks
> 4) Last run reports 92 badblocks.

It’s dying.  Replace it now.

On a modern hard disk, you should *never* see bad sectors, because the drive is 
busy hiding all the bad sectors it does find, then telling you everything is 
fine.

Once the drive has swept so many problems under the rug that it is forced to 
admit to normal user space programs (e.g. badblocks) that there are bad 
sectors, it’s because the spare sector pool is full.  At that point, the only 
safe remediation is to replace the disk.

> Running smartctl after the last badblocks check I've noticed that 
> Current_Pending_Sector was 32 (not 92 as badblocks found).

SMART is allowed to lie to you.  That’s why there’s the RAW_VALUE column, yet 
there is no explanation in the manual as to what that value means.  The reason 
is, the low-level meanings of these values are documented by the drive 
manufacturers.  “92” is not necessarily a sector count.  For all you know, it 
is reporting that there are currently 92 lemmings in midair off the fjords of 
Finland.

The only important results here are:

a) the numbers are nonzero
b) the numbers are changing

That is all.  A zero value just means it hasn’t failed *yet*, and a static 
nonzero value means the drive has temporarily arrested its failures-in-progress.

There is no such thing as a hard drive with zero actual bad sectors, just one 
that has space left in its spare sector pool.  A “working” drive is one that is 
swapping sectors from the spare pool rarely enough that it is expected not to 
empty the pool before the warranty expires.

> Why each consecutive run of badblocks reports different results?

Because physics.  The highly competitive nature of the HDD business plus the 
relentless drive of Moore’s Business Law — as it should be called, since it is 
not a physical law, just an arbitrary fiction that the tech industry has bought 
into as the ground rules for the game — pushes the manufacturers to design them 
right up against the ragged edge of functionality.

HDD manufacturers could solve all of this by making them with 1/4 the capacity 
and twice the cost and get 10x the reliability.  And they do: they’re called 
SAS drives. :)

> Why smartctl does not update Reallocated_Event_Count?

Because SMART lies.

> What other test I can perform to verify disks problems?

Quit poking the tiger to see if it will bite you.  Replace the bad disk and 
resilver that mirror before you lose the other disk, too.
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread John R Pierce

On 1/19/2016 3:29 PM, J Martin Rushton wrote:

I suspect that the gold layer on edge connectors 30-odd years ago was
a lot thicker than on modern cards.  We are talking contacts on 0.1"
spacing not some modern 1/10 of a knat's whisker.  (Off topic) I also
remember seeing engineers determine which memory chip was at fault and
replacing the chip using a soldering iron.  Try that on a DIMM!


indeed, I pretty much quit doing component level electronics when 
everything went to surface mount.




--
john r pierce, recycling bits in santa cruz

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Valeri Galtsev

On Tue, January 19, 2016 4:48 pm, John R Pierce wrote:
> On 1/19/2016 2:24 PM, Warren Young wrote:
>> It’s dying.  Replace it now.
>
> agreed
>
>> On a modern hard disk, you should*never*  see bad sectors, because the
>> drive is busy hiding all the bad sectors it does find, then telling you
>> everything is fine.
>
> thats not actually true.the drive will report 'bad sector' if you
> try and read data that the drive simply can't read.   you wouldn't want
> it to return bad data and say its OK. many(most?) drives won't
> actually remap to a bad sector until you write new data over that block
> number, since they don't want to copy bad data without any way of
> telling the OS the data is invalid. these pending remaps are listed
> under smart parameter 197 Current_Pending_Sector
>

Apparently, you know more about modern drives than I do, but as far as I
know it is a bit longer story when bad block is discovered. Here it is.

Basically, bad blocks are discovered on read operation when CRC (cyclic
redundancy check) sum does not match. (in fact it is a bit more
sophisticated than just CRC, as modern high data density drives are trying
to match some analog signal they get on read head to digitally coded upon
record). When this discovery happens, firmware decides, this is a bad
block, adds its new location in badblock re-allocation table (a while ago
when I learned this this reallocation table was located in non-volatile
memory of drive controller board). Then firmware hold all other tasks and
tries to recover the information stored in bad block. It re-reads it and
superimposes read results until the CRC matches and then writes recovered
data into re-allocated place, or gives up after some large number of
attempts, then it writes whatever garbage it ends up with into
re-allocated place and reports fatal read error. This attempt of recovery
of bad blocks very noticeably slows down IO on device. So, "freezing" on
some IO when accessing files may be indication of developing of multiple
bad blocks. Time to replace the drive. The drive (even after irrecoverable
- fatal - read error) is still considered usable, only when bad block
re-allocation table fills up, the drive starts reporting that it is "out
of specs".

On a side note: even if CRC matches, it doesn't ensure that recovered data
is the same as data originally written. This is why filesystems that keep
sophisticated checksums of files are getting popular (zfs to name one).

Just my $0.02.

Valeri

>
>
> --
> john r pierce, recycling bits in santa cruz
>
> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos
>



Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Chris Murphy
On Tue, Jan 19, 2016 at 3:24 PM, Warren Young  wrote:

> On a modern hard disk, you should *never* see bad sectors, because the drive 
> is busy hiding all the bad sectors it does find, then telling you everything 
> is fine.

This is not a given. Misconfiguration can make persistent bad sectors
very common, and this misconfiguration is the default situation in
RAID setups on Linux, which is why it's so common. This, and user
error, are the top causes for RAID 5 implosion on Linux (both mdadm
and lvm raid). The necessary sequence:

1. The drive needs to know the sector is bad.
2. The drive needs to be asked to read that sector.
3. The drive needs to give up trying to read that sector.
4. The drive needs to report the sector LBA back to the OS.
5. The OS needs to write something back to that same LBA.
6. The drive will write to the sector, and if it fails, will remap the
LBA to a different (reserve) physical sector.

Where this fails on Linux is step 3 and 4. By default consumer drives
either don't support SCT ERC, such as in the case in this thread, or
it's disabled. That condition means the time out for deep recovery of
bad sectors can be very high, 2 or 3 minutes. Usually it's less than
this, but often it's more than the kernel's default SCSI command
timer. When a command to the drive doesn't complete successfully in
the default of 30 seconds, the kernel resets the link to the drive,
which obliterates the entire command queue contents and the work it
was doing to recover the bad sector. Therefore step 4 never happens,
and no steps after that either.

Hence, bad sectors accumulate. And the consequence of this often
doesn't get figured out until a user looks at kernel messages and sees
a bunch of hard link resets and has a WTF moment, and asks questions.
More often they don't see those reset messages, or they don't ask
about them, so the next consequence is a drive fails. When it's a
drive other than one with bad sectors, in effect there are two bad
strips per stripe during reads (including rebuild) and that's when
there's total array collapse even though there was only one bad drive.
As a mask for this problem people are using RAID 6, but it's still a
misconfiguration that can cause RAID6 failures too.


>> Why smartctl does not update Reallocated_Event_Count?
>
> Because SMART lies.

Nope. The drive isn't being asked to write to those bad sectors. If it
can't successfully read the sector without error, it won't migrate the
data on its own (some drives never do this). So it necessitates a
write to the sector to cause the remap to happen.

The other thing is the bad sector count on 512e AF drives is inflated.
The number of bad sectors is in 512 byte sector increments. But there
is no such thing on an AF drive. One bad physical sector will be
reported as 8 bad sectors. And to fix the problem it requires writing
exactly all 8 of those logical sectors at one time in a single command
to the drive. Ergo I've had 'dd if=/dev/zero of=/dev/sda seek=blah
count=8' fail with a read error, due to the command being internally
reinterpreted as read-modify-write. Ridiculous but true. So you have
to use bs=4096 and count=1, and of course adjust seek LBA to be based
on 4096 bytes instead of 512.

So the simplest fix here is:

echo 160 /sys/block/sdX/device/timeout/

That's needed for each member drive. Note this is not a persistent
setting. And then this:

echo repair > /sys/block/mdX/md/sync_action

That's once. You'll see the read errors in dmesg, and md writing back
to the drive with the bad sector.

This problem affects all software raid, including btrfs raid1. The
ideal scenario is you'll use 'smartctl -l scterc,70,70 /dev/sdX' in
startup script, so the drive fails reads on marginally bad sectors
with an error in 7 seconds maximum.

The linux-raid@ list if chock full of this as a recurring theme.

-- 
Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Alessandro Baggi

Il 18/01/2016 16:47, Matt Garman ha scritto:

That's strange, I expected the SMART test to show some issues.
Personally, I'm still not confident in that drive.  Can you check
cabling?  Another possibility is that there is a cable that has
vibrated into a marginal state.  Probably a long shot, but if it's
easy to get physical access to the machine, and you can afford the
downtime to shut it down, open up the chassis and re-seat the drive
and cables.

Every now and then I have PCIe cards that work fine for years, then
suddenly disappear after a reboot.  I re-seat them and they go back to
being fine for years.  So I believe vibration does sometimes play a
role in mysterious problems that creep up from time to time.



On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi
 wrote:

Il 18/01/2016 12:09, Chris Murphy ha scritto:


What is the result for each drive?

smartctl -l scterc 


Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos
.


SCT Error Recovery Control command not supported

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


This is a notebook.
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread J Martin Rushton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I suspect that the gold layer on edge connectors 30-odd years ago was
a lot thicker than on modern cards.  We are talking contacts on 0.1"
spacing not some modern 1/10 of a knat's whisker.  (Off topic) I also
remember seeing engineers determine which memory chip was at fault and
replacing the chip using a soldering iron.  Try that on a DIMM!

On 19/01/16 00:39, Peter wrote:
> On 19/01/16 12:34, J Martin Rushton wrote:
>> Not new: I can remember seeing DEC engineers cleaning up the 
>> contacts on memory boards for a VAX 11/782 with a pencil eraser 
>> c.1985.  It's still a pretty standard first fix to reseat a card
>> or connector.
> 
> I used to do that as well.  The contacts would come out nice and
> shiny when you clean them.  Then I found out that what I was
> actually doing was removing the very thin layer of gold plating on
> the contacts and revealing the copper underneath.  That's why you
> should never clean contacts with a pencil eraser, just re-seat the
> boards and they'll make contact again.
> 
> 
> Peter ___ CentOS
> mailing list CentOS@centos.org 
> https://lists.centos.org/mailman/listinfo/centos
> 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJWnsboAAoJEAF3yXsqtyBlm3UP/00YVsd3q7lEY1S9fXw/Y5YI
uLkfigS6I9/LqVMXL3YT/pmrE745kxROUEfO0qPHDETx2gcjLlq+gOJ1icJMD2sW
hn8JY4Pv7IouK4DvZ3hXCIaqQgegsYcqoEi5Ii1F+T+qguS0CHxLqZ98vocPmSnB
IOpwId7L5/wgdtsTKgSHcwl2WaEvp2WojOFP7Asv9QbgCxfhpI6AYN3uptjeR0zQ
868Eo5tRoS2g42BijN6PZTc+1HnpFhp/K/Gy/GxzqvBKG+rx/heF4u/xOkCkH78y
qbDXVTNcxqr8Uf65302gfGompXPZO56czvMOlxUWcvKU6O2fSlhRfgF1v2XR+7pL
jf7E332bSml1W0NlA/fzM9HIXTgX3t+BJ9P0F1wjOLGuXuYf5zvdjXff4yjJlgGC
hVc6OEzUZrDM1EAE+Eu8ENqzVg6aTQGm17FkDRuKqVF3XXE4Ok5Pm0+scMjylVRX
V0yXKMzRH+T5IUJK6xnFU84Eadr7dl0EUf6f+EX61wDhhUauK3N5XnVNaOZ3dLom
fuTjk5Qg+BBHVQPG1/Ud4KK8sEZbB55zc7udinhbh45onn3oUGA4LtoA6wqqgHMJ
ue87YC/Y5FPd8G3neei/bVkGrr2EEWgSxJ2PQYxeiXda6trf8nzBNqz7vXK4/mDw
fAl4ie58Zd0vGbRAb2hB
=iezY
-END PGP SIGNATURE-
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-19 Thread Valeri Galtsev

On Tue, January 19, 2016 5:29 pm, J Martin Rushton wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> I suspect that the gold layer on edge connectors 30-odd years ago was
> a lot thicker than on modern cards.

I remember a long time ago - that actually was in the country "Far -Far
Away" ;-) We were not allowed to dispose of connectors with gold plated
contacts. These were collected, and gold was extracted from them and
re-used. I believe, there were dissolving base brass material with acid,
then just melted the thin gold shells left. Not useful with modern super
thin plating.

> We are talking contacts on 0.1"
> spacing not some modern 1/10 of a knat's whisker.  (Off topic) I also
> remember seeing engineers determine which memory chip was at fault and
> replacing the chip using a soldering iron.  Try that on a DIMM!
>
> On 19/01/16 00:39, Peter wrote:
>> On 19/01/16 12:34, J Martin Rushton wrote:
>>> Not new: I can remember seeing DEC engineers cleaning up the
>>> contacts on memory boards for a VAX 11/782 with a pencil eraser
>>> c.1985.  It's still a pretty standard first fix to reseat a card
>>> or connector.
>>
>> I used to do that as well.  The contacts would come out nice and
>> shiny when you clean them.  Then I found out that what I was
>> actually doing was removing the very thin layer of gold plating on
>> the contacts and revealing the copper underneath.  That's why you
>> should never clean contacts with a pencil eraser, just re-seat the
>> boards and they'll make contact again.
>>



Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread Matt Garman
That's strange, I expected the SMART test to show some issues.
Personally, I'm still not confident in that drive.  Can you check
cabling?  Another possibility is that there is a cable that has
vibrated into a marginal state.  Probably a long shot, but if it's
easy to get physical access to the machine, and you can afford the
downtime to shut it down, open up the chassis and re-seat the drive
and cables.

Every now and then I have PCIe cards that work fine for years, then
suddenly disappear after a reboot.  I re-seat them and they go back to
being fine for years.  So I believe vibration does sometimes play a
role in mysterious problems that creep up from time to time.



On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi
 wrote:
> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>>
>> What is the result for each drive?
>>
>> smartctl -l scterc 
>>
>>
>> Chris Murphy
>> ___
>> CentOS mailing list
>> CentOS@centos.org
>> https://lists.centos.org/mailman/listinfo/centos
>> .
>>
> SCT Error Recovery Control command not supported
>
> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread Alessandro Baggi

Il 17/01/2016 19:36, Alessandro Baggi ha scritto:

Il 17/01/2016 18:46, Brandon Vincent ha scritto:

On Sun, Jan 17, 2016 at 10:05 AM, Matt Garman
 wrote:

I'm not sure what's going on with your drive. But if it were mine,
I'd want
to replace it. If there are issues, that long smart check ought to
turn up
something,  and in my experience, that's enough for a manufacturer to
do a
warranty replacement.


I agree with Matt. Go ahead and run a few of the S.M.A.R.T. tests. I
can almost guarantee based off of your description of your problem
that they will fail.

badblocks(8) is a very antiquated tool. Almost every hard drive has a
few bad sectors from the factory. Very old hard drives used to have a
list of the bad sectors printed on the front of the label. When you
first created a filesystem you had to enter all of the bad sectors
from the label so that the filesystem wouldn't store data there. Years
later, more bad sectors would form and you could enter them into the
filesystem by discovering them using a tool like badblocks(8).

Today, drives do all of this work automatically. The manufacturer of a
hard drive will scan the entire surface and write the bad sectors into
a section of the hard drive's electronics known as the P-list. The
controller on the drive will automatically remap these sectors to a
small area of unused sectors set aside for this very purpose. Later if
more bad sectors form, hard drives when they see a bad sector will
enter it into a list known as the G-list and then remap this sector to
other sectors in the unused area of the drive I mentioned earlier.

Basically under normal conditions, the end user should NEVER see bad
sectors from their perspective. If badblocks(8) is reporting bad
sectors, it is very likely that enough bad sectors have formed to the
point where the unused reserved sectors is depleted of replacement
sectors. While in theory you could run badblocks(8) and pass it to the
filesystem, I can ensure you that the growth of bad sectors at this
point has reached a point in which it will continue.

I'd stop using that hard drive, pull any important data, and then
proceed to run S.M.A.R.T. tests so if the drive is under warranty you
can have it replaced.

Brandon Vincent
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


I'm running long smart test. I'll report data when finished


I've performed smartctl test on sda. This is the result from smartctl -a 
/dev/sda:



smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.4.4.el7.x86_64] 
(local build)

Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WD6400BEVT80A0RT0
Serial Number:WD-WXF0AB9Y6939
LU WWN Device Id: 5 0014ee 0ac91c337
Firmware Version: 01.01A01
User Capacity:640,135,028,736 bytes [640 GB]
Sector Size:  512 bytes logical/physical
Rotation Rate:5400 rpm
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:Mon Jan 18 09:42:01 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: 
Disabled.
Self-test execution status:  (   0) The previous self-test routine 
completed
without error or no self-test 
has ever

been run.
Total time to complete Offline
data collection:(15960) seconds.
Offline data collection
capabilities:(0x7b) SMART execute Offline immediate.
Auto Offline data collection 
on/off support.

Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:

Re: [CentOS] HDD badblocks

2016-01-18 Thread Gordon Messmer

On 01/18/2016 07:47 AM, Matt Garman wrote:

Another possibility is that there is a cable that has
vibrated into a marginal state.


That wouldn't explain the SMART data reporting pending sectors.

According to spec, a drive may not reallocate sectors after a read error 
if it's later able to read the sector successfully.  That's probably 
what happened here.


Drives are consumable items in computing.  They have to be replaced 
eventually.  Read errors are often an early sign of failure.  The drive 
may continue to work for a while before it fails.  The only question is: 
is the value of whatever amount of time it has left greater than the 
cost of replacing it?

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread Alessandro Baggi

Il 18/01/2016 12:09, Chris Murphy ha scritto:

What is the result for each drive?

smartctl -l scterc 


Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos
.


SCT Error Recovery Control command not supported
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread Chris Murphy
What is the result for each drive?

smartctl -l scterc 


Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread Chris Murphy
Also useful, complete dmesg posted somewhere (unless your MUA can be set to
not wrap lines)

Chris Murphy
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread J Martin Rushton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Not new: I can remember seeing DEC engineers cleaning up the contacts
on memory boards for a VAX 11/782 with a pencil eraser c.1985.  It's
still a pretty standard first fix to reseat a card or connector.

On 18/01/16 15:47, Matt Garman wrote:
> That's strange, I expected the SMART test to show some issues. 
> Personally, I'm still not confident in that drive.  Can you check 
> cabling?  Another possibility is that there is a cable that has 
> vibrated into a marginal state.  Probably a long shot, but if it's 
> easy to get physical access to the machine, and you can afford the 
> downtime to shut it down, open up the chassis and re-seat the
> drive and cables.
> 
> Every now and then I have PCIe cards that work fine for years,
> then suddenly disappear after a reboot.  I re-seat them and they go
> back to being fine for years.  So I believe vibration does
> sometimes play a role in mysterious problems that creep up from
> time to time.
> 
> 
> 
> On Mon, Jan 18, 2016 at 5:39 AM, Alessandro Baggi 
>  wrote:
>> Il 18/01/2016 12:09, Chris Murphy ha scritto:
>>> 
>>> What is the result for each drive?
>>> 
>>> smartctl -l scterc 
>>> 
>>> 
>>> Chris Murphy ___ 
>>> CentOS mailing list CentOS@centos.org 
>>> https://lists.centos.org/mailman/listinfo/centos .
>>> 
>> SCT Error Recovery Control command not supported
>> 
>> ___ CentOS mailing
>> list CentOS@centos.org 
>> https://lists.centos.org/mailman/listinfo/centos
> ___ CentOS mailing
> list CentOS@centos.org 
> https://lists.centos.org/mailman/listinfo/centos
> 
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJWnXaVAAoJEAF3yXsqtyBlQJ0P/i92NZYQvNiwK3a/jUDJpwcV
7lHGPJzdAFbR2VRTblrvtxWifLle8FhDde7O4zh+3j1R/Jt49f61764eEXAjsP7M
xb9JtaPvVxpTNFygqfh9n9/wZkJCmokYFvd8KLWqQuZDqa8R89z/KRM1IxR4W3Ux
s+bk5BYTvybRcV+tmhlSOQC0GcZj108b/4Ki2AuHEVTCJQ6TlY/J3cSN/bhmiNcc
Tmj3mamgnjmOEdKbtNpbrA3tTvfY1/OY7wqqBYtojaqPKB38RIFhqr0z1bEhkLQy
oB3Y4Nw1nW/r+KrFuHE2siBI/qTRR0Pf/RwPU7LLGrsjUgTwygVhp4tivb6wOFgQ
YLVJNC8+XdNxYuSrdyvfkCrU1LyW/4HLmaANj78ZjlakB80WNkWmocoJrGBGnp3E
2akAUJV7CS/+xkXMyJuWhkKFjMkjzn+o2TFD9Fw9Re+NNtvmtRSQ54C4zlyXWKOI
xxPajRRmHfXQObi0kkGHABZqSUAwXt60YQmalZfKXO8bWE0ySALc0OE9GFjvNh4V
tX+PUoKfgtCEoSRMcFIytMJxc46prgS0OakHew0jlBCDOEEl9Kyyo0OsEOy1+jpd
hKeVQ66h5+Xv+FqXf/JUQmNO3xo+zUCjIDNIPeQbyLjYNQHicy/WIqZ2kLRKdu1q
ZZE5IlmRmnALqLxE5MZd
=zUh6
-END PGP SIGNATURE-
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-18 Thread Peter
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 19/01/16 12:34, J Martin Rushton wrote:
> Not new: I can remember seeing DEC engineers cleaning up the
> contacts on memory boards for a VAX 11/782 with a pencil eraser
> c.1985.  It's still a pretty standard first fix to reseat a card or
> connector.

I used to do that as well.  The contacts would come out nice and shiny
when you clean them.  Then I found out that what I was actually doing
was removing the very thin layer of gold plating on the contacts and
revealing the copper underneath.  That's why you should never clean
contacts with a pencil eraser, just re-seat the boards and they'll
make contact again.


Peter
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQEcBAEBAgAGBQJWnYXFAAoJEAUijw0EjkDv+4EH/0wbRnoM9KePC0UUhViIdFZ1
cpaVvNqre+zVd6qBhnzbcPT+lXINHZ5Mm/Rw0tcBqx8nYzYab5qS5hTRaZOTm6H/
aXbH6shJC4o1LW/fGDkMZ0V8ZGgz4uN4cdMYN87rVqX+J477Igs3D4yO9Gxux6K7
Eqn3+kBECL5iBiFdOf86H0UoNZuUHkfMpj95R4AJnywTCqB5W1XaVQPViNs/ge16
5Ipk7uopfbREM+F60hI889XH3s0eUXabZsTUGJWE/nUd/sNCdnAqBVD3aTuwz6gv
sb599qH8AAFX8pz0DloslLVQRb0pExbAHGM/IDMMwi1aJHGtbSDtphrt369aKJo=
=5+bz
-END PGP SIGNATURE-
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-17 Thread Alessandro Baggi

Il 17/01/2016 18:46, Brandon Vincent ha scritto:

On Sun, Jan 17, 2016 at 10:05 AM, Matt Garman  wrote:

I'm not sure what's going on with your drive. But if it were mine, I'd want
to replace it. If there are issues, that long smart check ought to turn up
something,  and in my experience, that's enough for a manufacturer to do a
warranty replacement.


I agree with Matt. Go ahead and run a few of the S.M.A.R.T. tests. I
can almost guarantee based off of your description of your problem
that they will fail.

badblocks(8) is a very antiquated tool. Almost every hard drive has a
few bad sectors from the factory. Very old hard drives used to have a
list of the bad sectors printed on the front of the label. When you
first created a filesystem you had to enter all of the bad sectors
from the label so that the filesystem wouldn't store data there. Years
later, more bad sectors would form and you could enter them into the
filesystem by discovering them using a tool like badblocks(8).

Today, drives do all of this work automatically. The manufacturer of a
hard drive will scan the entire surface and write the bad sectors into
a section of the hard drive's electronics known as the P-list. The
controller on the drive will automatically remap these sectors to a
small area of unused sectors set aside for this very purpose. Later if
more bad sectors form, hard drives when they see a bad sector will
enter it into a list known as the G-list and then remap this sector to
other sectors in the unused area of the drive I mentioned earlier.

Basically under normal conditions, the end user should NEVER see bad
sectors from their perspective. If badblocks(8) is reporting bad
sectors, it is very likely that enough bad sectors have formed to the
point where the unused reserved sectors is depleted of replacement
sectors. While in theory you could run badblocks(8) and pass it to the
filesystem, I can ensure you that the growth of bad sectors at this
point has reached a point in which it will continue.

I'd stop using that hard drive, pull any important data, and then
proceed to run S.M.A.R.T. tests so if the drive is under warranty you
can have it replaced.

Brandon Vincent
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


I'm running long smart test. I'll report data when finished
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


[CentOS] HDD badblocks

2016-01-17 Thread Alessandro Baggi

Hi list,
I've a notebook with C7 (1511). This notebook has 2 disk (640 GB) and 
I've configured them with MD at level 1. Some days ago I've noticed some 
critical slowdown while opening applications.


First of all I've disabled acpi on disks.


I've checked disk for badblocks 4 consecutive times for disk sda and sdb 
and I've noticed a strange behaviour.


On sdb there are not problem but with sda:

1) First run badblocks reports 28 badblocks on disk
2) Second run badblocks reports 32 badblocks
3) Third reports 102 badblocks
4) Last run reports 92 badblocks.


Running smartctl after the last badblocks check I've noticed that 
Current_Pending_Sector was 32 (not 92 as badblocks found).


To force sector reallocation I've filled the disk up to 100%, runned 
again badblocks and 0 badblocks found.
Running again smartctl, Current_Pending_Sector 0 but Reallocated_Event 
Count = 0.


Why each consecutive run of badblocks reports different results?
Why smartctl does not update Reallocated_Event_Count?
Badblocks found on sda increase/decrease without a clean reason. This 
behaviuor can be related with raid (if a disk had badblocks this 
badblock can be replicated on second disk?)?


What other test I can perform to verify disks problems?

Thanks in advance.
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-17 Thread Matt Garman
Have you ran a "long" smart test on the drive?  Smartctl -t long device

I'm not sure what's going on with your drive. But if it were mine, I'd want
to replace it. If there are issues, that long smart check ought to turn up
something,  and in my experience, that's enough for a manufacturer to do a
warranty replacement.
On Jan 17, 2016 11:00, "Alessandro Baggi" 
wrote:

> Hi list,
> I've a notebook with C7 (1511). This notebook has 2 disk (640 GB) and I've
> configured them with MD at level 1. Some days ago I've noticed some
> critical slowdown while opening applications.
>
> First of all I've disabled acpi on disks.
>
>
> I've checked disk for badblocks 4 consecutive times for disk sda and sdb
> and I've noticed a strange behaviour.
>
> On sdb there are not problem but with sda:
>
> 1) First run badblocks reports 28 badblocks on disk
> 2) Second run badblocks reports 32 badblocks
> 3) Third reports 102 badblocks
> 4) Last run reports 92 badblocks.
>
>
> Running smartctl after the last badblocks check I've noticed that
> Current_Pending_Sector was 32 (not 92 as badblocks found).
>
> To force sector reallocation I've filled the disk up to 100%, runned again
> badblocks and 0 badblocks found.
> Running again smartctl, Current_Pending_Sector 0 but Reallocated_Event
> Count = 0.
>
> Why each consecutive run of badblocks reports different results?
> Why smartctl does not update Reallocated_Event_Count?
> Badblocks found on sda increase/decrease without a clean reason. This
> behaviuor can be related with raid (if a disk had badblocks this badblock
> can be replicated on second disk?)?
>
> What other test I can perform to verify disks problems?
>
> Thanks in advance.
> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos
>
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] HDD badblocks

2016-01-17 Thread Brandon Vincent
On Sun, Jan 17, 2016 at 10:05 AM, Matt Garman  wrote:
> I'm not sure what's going on with your drive. But if it were mine, I'd want
> to replace it. If there are issues, that long smart check ought to turn up
> something,  and in my experience, that's enough for a manufacturer to do a
> warranty replacement.

I agree with Matt. Go ahead and run a few of the S.M.A.R.T. tests. I
can almost guarantee based off of your description of your problem
that they will fail.

badblocks(8) is a very antiquated tool. Almost every hard drive has a
few bad sectors from the factory. Very old hard drives used to have a
list of the bad sectors printed on the front of the label. When you
first created a filesystem you had to enter all of the bad sectors
from the label so that the filesystem wouldn't store data there. Years
later, more bad sectors would form and you could enter them into the
filesystem by discovering them using a tool like badblocks(8).

Today, drives do all of this work automatically. The manufacturer of a
hard drive will scan the entire surface and write the bad sectors into
a section of the hard drive's electronics known as the P-list. The
controller on the drive will automatically remap these sectors to a
small area of unused sectors set aside for this very purpose. Later if
more bad sectors form, hard drives when they see a bad sector will
enter it into a list known as the G-list and then remap this sector to
other sectors in the unused area of the drive I mentioned earlier.

Basically under normal conditions, the end user should NEVER see bad
sectors from their perspective. If badblocks(8) is reporting bad
sectors, it is very likely that enough bad sectors have formed to the
point where the unused reserved sectors is depleted of replacement
sectors. While in theory you could run badblocks(8) and pass it to the
filesystem, I can ensure you that the growth of bad sectors at this
point has reached a point in which it will continue.

I'd stop using that hard drive, pull any important data, and then
proceed to run S.M.A.R.T. tests so if the drive is under warranty you
can have it replaced.

Brandon Vincent
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos