Re: smartctl cannot access my storage, need syntax help

gene heskett Sun, 14 Jan 2024 11:49:58 -0800

On 1/14/24 07:42, David Christensen wrote:

Re-ordered for clarity -- David.

And snipped by Gene as I updated

On 1/12/24 18:42, gene heskett wrote:
I just found an mbox file in my home directory, containing about 90days worth of undelivered msgs from smartctl running as root.
Do you know how the mbox file got there?

No, it just appeared.

smartctl says my raid10 is dying, ...



Please post a console session with a command that displays the message.

This is a copy/paste of the second message in that file, the first fromsmartctl, followed by the last message in that file:


From [email protected] Wed Nov 02 00:29:05 2022
Return-path: <[email protected]>
Envelope-to: [email protected]
Delivery-date: Wed, 02 Nov 2022 00:29:05 -0400
Received: from root by coyote.coyote.den with local (Exim 4.94.2)
        (envelope-from <[email protected]>)
        id 1oq5NB-000DBx-15
        for [email protected]; Wed, 02 Nov 2022 00:29:05 -0400
To: [email protected]
Subject: SMART error (SelfTest) detected on host: coyote
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <[email protected]>
From: root <[email protected]>
Date: Wed, 02 Nov 2022 00:29:05 -0400
Content-Length: 513
Lines: 16
Status: RO
X-Status:
X-Keywords:
X-UID: 2

This message was generated by the smartd daemon running on:

   host name:  coyote
   DNS domain: coyote.den

The following warning/error was logged by the smartd daemon:

Device: /dev/sde [SAT], Self-Test Log error count increased from 0 to 1

Device info:

Samsung SSD 870 EVO 1TB, S/N:S626NF0R302507V, WWN:5-002538-f413394ae,FW:SVT01B6Q, 1.00 TB


For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

======= 3 more identical msgs refering to the other 3 drives in theraid.=====

From [email protected] Wed Nov 16 06:22:02 2022
Return-path: <[email protected]>
Envelope-to: [email protected]
Delivery-date: Wed, 16 Nov 2022 06:22:02 -0500
Received: from root by coyote.coyote.den with local (Exim 4.94.2)
        (envelope-from <[email protected]>)
        id 1ovGUR-0000De-Bc
        for [email protected]; Wed, 16 Nov 2022 06:21:59 -0500
To: [email protected]
Subject: SMART error (SelfTest) detected on host: coyote
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <[email protected]>
From: root <[email protected]>
Date: Wed, 16 Nov 2022 06:21:59 -0500
Content-Length: 592
Lines: 17
Status: RO
X-Status:
X-Keywords:
X-UID: 9

This message was generated by the smartd daemon running on:

   host name:  coyote
   DNS domain: coyote.den

The following warning/error was logged by the smartd daemon:

Device: /dev/sdd [SAT], Self-Test Log error count increased from 1 to 2

Device info:

Samsung SSD 870 EVO 1TB, S/N:S626NF0R302502E, WWN:5-002538-f413394a9,FW:SVT01B6Q, 1.00 TB


For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.

The original message about this issue was sent at Wed Nov 2 06:59:042022 EDT

Another message will be sent in 24 hours if the problem persists.

I also note they are now very old messages but the file itself is datedJan 7nth. And syslog has been rotated several times since.

I'm not expert at interpreting smartctl reports, but I do not see suchin the smarttcl output now. going backwads thru the list, the 4th drivein the raid has had 3334 errors, as had the third drive with 3332ettors, the 1st and 2nd are clean.


One stanza of the error report:

Error 3328 occurred at disk power-on lifetime: 21027 hours (876 days + 3hours)When the command that caused the error occurred, the device wasactive or idle.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 28 00 54 a9 40  Error: UNC at LBA = 0x00a95400 = 11097088

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 00 54 a9 40 05      15:16:34.891  READ FPDMA QUEUED
  61 18 18 e8 ea 67 40 03      15:16:34.891  WRITE FPDMA QUEUED
  60 00 10 00 5e a9 40 02      15:16:34.891  READ FPDMA QUEUED
  60 28 08 00 f4 87 40 01      15:16:34.891  READ FPDMA QUEUED
  60 00 00 00 7c a9 40 00      15:16:34.891  READ FPDMA QUEUED

SMART Self-test log structure revision number 1

Num Test_Description Status RemainingLifeTime(hours) LBA_of_first_error# 1 Extended offline Completed: read failure 50% 109171847474376# 2 Extended offline Completed: read failure 50% 105861847474376

So half the samsung 870's are on their way out. But nothing recent...So I am now trying to get a good rsync copy on another drive.


On 1/12/24 20:57, gene heskett wrote:
 > ... there are 4 1t drives as a raid10, and the
 > various messages in that mbox file name all of the individual drives.


Please post a representative sample of the messages.


See above, most of it is swahili to me.

 > Then I find the linux has played 52 pickup with the device names.
/dev/sd* device node names are unpredictable. The traditional solutionis UUID's. Linux added /dev/disk/by-id/* a while ago and I am startingto use them as much as possible. Make sure you look very carefully atthe serial numbers when you have several drives of the same make and model.
 > There are in actual fact 3 sata controller is this machine, the
 > motherboards 6 ports, 6 more on an inexpensive sata controller that are
 > actually the 4 raid10 Samsung 870 1T drives, and 4 more on a more
 > sxpensive 16 port card which has a quartet of 2T gigastone SSD's on it,
 > but the drives are not found in the order of the controllers. That
 > raid10 was composed w/o the third controller.


So:

* /home is on a RAID 10 with 2 @ mirror of 2 @ 1 TB Samsung 870 SSD?

I think thasts what you call a raid10

* 4 @ 2 TB Gigastone SSD for a new RAID 10?


just installed, not mounted or made into a raid yet. WIP?


What drives are connected to which ports?

4 Samsung 870 1T's are on the 1st added controller.
ATM 5, 2T gigastone's are on the 2nd, 16 port added controller
smarttcl says all 5 of those are fine.



What is on the other 20 ports?

On the mobo? A big dvd writer and 2 other half T or 1T samsung drivesfrom earlier 860 runs, not currently mounted. No spinning rust anyplacenow. I don't appreciate being a lab rat for seagate to experiment on.

A current lsblk:
gene@coyote:~$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINTS
sda           8:0    0 931.5G  0 disk
├─sda1        8:1    0 838.2G  0 part   /
├─sda2        8:2    0  46.8G  0 part   [SWAP]
└─sda3        8:3    0  46.6G  0 part   /tmp

sdb 8:16 1 0B 0 disk is probably my camera, currentlyplugged in

sdc 8:32 1 0B 0 disk is probably my brotherMFP-J6920DW printer, always plugged in

first controller, 6 port
sdd           8:48   0 931.5G  0 disk
├─sdd1        8:49   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdd2        8:50   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdd3        8:51   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10
sde           8:64   0 931.5G  0 disk
├─sde1        8:65   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sde2        8:66   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sde3        8:67   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10
sdf           8:80   0 931.5G  0 disk
├─sdf1        8:81   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdf2        8:82   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdf3        8:83   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10
sdg           8:96   0 931.5G  0 disk
├─sdg1        8:97   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdg2        8:98   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdg3        8:99   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10

2nd controller, 16 ports, all 5 2T gigastone's
sdh           8:112  0   1.9T  0 disk
└─sdh1        8:113  0   1.9T  0 part
sdi           8:128  0   1.9T  0 disk
└─sdi1        8:129  0   1.9T  0 part
sdj           8:144  0   1.9T  0 disk
└─sdj1        8:145  0   1.9T  0 part
sdk           8:160  0   1.9T  0 disk
└─sdk1        8:161  0   1.9T  0 part
sdl           8:176  0   1.9T  0 disk
└─sdl1        8:177  0   1.9T  0 part
sr0          11:0    1  1024M  0 rom  The internal dvd writer
gene@coyote:~$



 > blkid does not sort them in order either. And of coarse does not list
 > whats unmounted, forcing me to ident the drive by gparted in order to
 > get its device name. From that I might be able to construct another raid
 > from the 8T of 4 2T drives but its confusing as hell when the first of
 > those 2T drives is assigned /dev/sde and the next 4 on the new
 > controller are /dev/sdi, j, k, & l.
 > So it appears I have 5 of those gigastones, and sde is the odd one

Which when it was /dev/sde1, was plugged into the 1st extra controller

When the data cable was plugged into a motherboard port, it became/dev/sdb1. So I've relabeled it, and about to test it on the second 16port controller.



I am confused -- do you have 4 or 5 Gigastone 2 TB SSD?


5,  ordered in 2 separate orders.


 > So that one could be formatted ext4 and serve as a backup of the raid10.

What I am trying to do now, but cannot if it is plugged into amotherboard port, hence the repeat of this exercise on the 2nd sata card.


 > how do I make an image of that
 > raid10  to /dev/sde and get every byte?  That seems like the first step
 > to me.

This I am still trying to do, the first pass copied all 350G of /homebut went to the wrong drive, and I had mounted the drive by its label.

It is now /dev/sdh and all labels above it are now wrong. Crazy.

These SSD's all have an OTP serial number. I am tempted to use thatserial number as a label _I_ can control. And according to gparted,labels do not survive being incorporated into a raid as the raid is alllabeled with hostname : partition number. So there really is no way inlinux to define a drive that is that drive forever. Unreal...

Please get a USB 3.x HDD, do a full backup of your entire computer, putit off-site, get another USB 3.x HDD, do another full backup, and keepit nearby

That, using amanda is the end target of this. But I have bought 3 suchspinning rust drives over the years and not had any survive being hotplugged into a usb port more than twice.


With that track record, I'll not waste any more money down that rabbit hole.


 >   But since I can't copy a locked file,


What file is lock?  Please post a console session that demonstrates.

A file that is opened but not closed is exclusive to that app and itslock, and cannot be copied except by rsync, or so I have been told. Andthere are quite a few such open locks on this system right now. Thiskilled my full housed amiga when the boot drive with all its customscripts died, and I found the backups I had were totally devoid of anyof those scripts. I still have about 20 QIC tapes from that machine, butnow no drives to read them. I need to cull the midden heap.


 > /dev/sde1 has been formatted and mounted, what cmd line will copy every
 > byte including locked files in that that raid10 to it?


See above for locked.  Otherwise, I suggest rsync(1).

[...]
Thank you David.

Cheers, Gene Heskett.
--
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author, 1940)
If we desire respect for the law, we must first make the law respectable.
 - Louis D. Brandeis

Re: smartctl cannot access my storage, need syntax help

Reply via email to