On 1/14/24 07:42, David Christensen wrote:
Re-ordered for clarity -- David.
And snipped by Gene as I updated

On 1/12/24 18:42, gene heskett wrote:
I just found an mbox file in my home directory, containing about 90 days worth of undelivered msgs from smartctl running as root.


Do you know how the mbox file got there?
No, it just appeared.


smartctl says my raid10 is dying, ...


Please post a console session with a command that displays the message.
This is a copy/paste of the second message in that file, the first from smartctl, followed by the last message in that file:

From r...@coyote.coyote.den Wed Nov 02 00:29:05 2022
Return-path: <r...@coyote.coyote.den>
Envelope-to: r...@coyote.coyote.den
Delivery-date: Wed, 02 Nov 2022 00:29:05 -0400
Received: from root by coyote.coyote.den with local (Exim 4.94.2)
        (envelope-from <r...@coyote.coyote.den>)
        id 1oq5NB-000DBx-15
        for r...@coyote.coyote.den; Wed, 02 Nov 2022 00:29:05 -0400
To: r...@coyote.coyote.den
Subject: SMART error (SelfTest) detected on host: coyote
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <e1oq5nb-000dbx...@coyote.coyote.den>
From: root <r...@coyote.coyote.den>
Date: Wed, 02 Nov 2022 00:29:05 -0400
Content-Length: 513
Lines: 16
Status: RO
X-Status:
X-Keywords:
X-UID: 2

This message was generated by the smartd daemon running on:

   host name:  coyote
   DNS domain: coyote.den

The following warning/error was logged by the smartd daemon:

Device: /dev/sde [SAT], Self-Test Log error count increased from 0 to 1

Device info:
Samsung SSD 870 EVO 1TB, S/N:S626NF0R302507V, WWN:5-002538-f413394ae, FW:SVT01B6Q, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.
======= 3 more identical msgs refering to the other 3 drives in the raid.=====
From r...@coyote.coyote.den Wed Nov 16 06:22:02 2022
Return-path: <r...@coyote.coyote.den>
Envelope-to: r...@coyote.coyote.den
Delivery-date: Wed, 16 Nov 2022 06:22:02 -0500
Received: from root by coyote.coyote.den with local (Exim 4.94.2)
        (envelope-from <r...@coyote.coyote.den>)
        id 1ovGUR-0000De-Bc
        for r...@coyote.coyote.den; Wed, 16 Nov 2022 06:21:59 -0500
To: r...@coyote.coyote.den
Subject: SMART error (SelfTest) detected on host: coyote
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <e1ovgur-0000de...@coyote.coyote.den>
From: root <r...@coyote.coyote.den>
Date: Wed, 16 Nov 2022 06:21:59 -0500
Content-Length: 592
Lines: 17
Status: RO
X-Status:
X-Keywords:
X-UID: 9

This message was generated by the smartd daemon running on:

   host name:  coyote
   DNS domain: coyote.den

The following warning/error was logged by the smartd daemon:

Device: /dev/sdd [SAT], Self-Test Log error count increased from 1 to 2

Device info:
Samsung SSD 870 EVO 1TB, S/N:S626NF0R302502E, WWN:5-002538-f413394a9, FW:SVT01B6Q, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Nov 2 06:59:04 2022 EDT
Another message will be sent in 24 hours if the problem persists.

I also note they are now very old messages but the file itself is dated Jan 7nth. And syslog has been rotated several times since.

I'm not expert at interpreting smartctl reports, but I do not see such in the smarttcl output now. going backwads thru the list, the 4th drive in the raid has had 3334 errors, as had the third drive with 3332 ettors, the 1st and 2nd are clean.

One stanza of the error report:
Error 3328 occurred at disk power-on lifetime: 21027 hours (876 days + 3 hours) When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 28 00 54 a9 40  Error: UNC at LBA = 0x00a95400 = 11097088

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 00 54 a9 40 05      15:16:34.891  READ FPDMA QUEUED
  61 18 18 e8 ea 67 40 03      15:16:34.891  WRITE FPDMA QUEUED
  60 00 10 00 5e a9 40 02      15:16:34.891  READ FPDMA QUEUED
  60 28 08 00 f4 87 40 01      15:16:34.891  READ FPDMA QUEUED
  60 00 00 00 7c a9 40 00      15:16:34.891  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 50% 10917 1847474376 # 2 Extended offline Completed: read failure 50% 10586 1847474376

So half the samsung 870's are on their way out. But nothing recent... So I am now trying to get a good rsync copy on another drive.

On 1/12/24 20:57, gene heskett wrote:
 > ... there are 4 1t drives as a raid10, and the
 > various messages in that mbox file name all of the individual drives.


Please post a representative sample of the messages.

See above, most of it is swahili to me.

 > Then I find the linux has played 52 pickup with the device names.


/dev/sd* device node names are unpredictable.  The traditional solution is UUID's.  Linux added /dev/disk/by-id/* a while ago and I am starting to use them as much as possible.  Make sure you look very carefully at the serial numbers when you have several drives of the same make and model.


 > There are in actual fact 3 sata controller is this machine, the
 > motherboards 6 ports, 6 more on an inexpensive sata controller that are
 > actually the 4 raid10 Samsung 870 1T drives, and 4 more on a more
 > sxpensive 16 port card which has a quartet of 2T gigastone SSD's on it,
 > but the drives are not found in the order of the controllers. That
 > raid10 was composed w/o the third controller.


So:

* /home is on a RAID 10 with 2 @ mirror of 2 @ 1 TB Samsung 870 SSD?
I think thasts what you call a raid10
* 4 @ 2 TB Gigastone SSD for a new RAID 10?

just installed, not mounted or made into a raid yet. WIP?


What drives are connected to which ports?
4 Samsung 870 1T's are on the 1st added controller.
ATM 5, 2T gigastone's are on the 2nd, 16 port added controller
smarttcl says all 5 of those are fine.


What is on the other 20 ports?
On the mobo? A big dvd writer and 2 other half T or 1T samsung drives from earlier 860 runs, not currently mounted. No spinning rust anyplace now. I don't appreciate being a lab rat for seagate to experiment on.
A current lsblk:
gene@coyote:~$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINTS
sda           8:0    0 931.5G  0 disk
├─sda1        8:1    0 838.2G  0 part   /
├─sda2        8:2    0  46.8G  0 part   [SWAP]
└─sda3        8:3    0  46.6G  0 part   /tmp

sdb 8:16 1 0B 0 disk is probably my camera, currently plugged in

sdc 8:32 1 0B 0 disk is probably my brother MFP-J6920DW printer, always plugged in
first controller, 6 port
sdd           8:48   0 931.5G  0 disk
├─sdd1        8:49   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdd2        8:50   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdd3        8:51   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10
sde           8:64   0 931.5G  0 disk
├─sde1        8:65   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sde2        8:66   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sde3        8:67   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10
sdf           8:80   0 931.5G  0 disk
├─sdf1        8:81   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdf2        8:82   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdf3        8:83   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10
sdg           8:96   0 931.5G  0 disk
├─sdg1        8:97   0   900G  0 part
│ └─md0       9:0    0   1.7T  0 raid10
│   └─md0p1 259:0    0   1.7T  0 part   /home
├─sdg2        8:98   0    30G  0 part
│ └─md1       9:1    0    60G  0 raid10 [SWAP]
└─sdg3        8:99   0   1.5G  0 part
  └─md2       9:2    0     3G  0 raid10

2nd controller, 16 ports, all 5 2T gigastone's
sdh           8:112  0   1.9T  0 disk
└─sdh1        8:113  0   1.9T  0 part
sdi           8:128  0   1.9T  0 disk
└─sdi1        8:129  0   1.9T  0 part
sdj           8:144  0   1.9T  0 disk
└─sdj1        8:145  0   1.9T  0 part
sdk           8:160  0   1.9T  0 disk
└─sdk1        8:161  0   1.9T  0 part
sdl           8:176  0   1.9T  0 disk
└─sdl1        8:177  0   1.9T  0 part
sr0          11:0    1  1024M  0 rom  The internal dvd writer
gene@coyote:~$



 > blkid does not sort them in order either. And of coarse does not list
 > whats unmounted, forcing me to ident the drive by gparted in order to
 > get its device name. From that I might be able to construct another raid
 > from the 8T of 4 2T drives but its confusing as hell when the first of
 > those 2T drives is assigned /dev/sde and the next 4 on the new
 > controller are /dev/sdi, j, k, & l.
 > So it appears I have 5 of those gigastones, and sde is the odd one
Which when it was /dev/sde1, was plugged into the 1st extra controller
When the data cable was plugged into a motherboard port, it became /dev/sdb1. So I've relabeled it, and about to test it on the second 16 port controller.


I am confused -- do you have 4 or 5 Gigastone 2 TB SSD?

5,  ordered in 2 separate orders.

 > So that one could be formatted ext4 and serve as a backup of the raid10.
What I am trying to do now, but cannot if it is plugged into a motherboard port, hence the repeat of this exercise on the 2nd sata card.

 > how do I make an image of that
 > raid10  to /dev/sde and get every byte?  That seems like the first step
 > to me.
This I am still trying to do, the first pass copied all 350G of /home but went to the wrong drive, and I had mounted the drive by its label.
It is now /dev/sdh and all labels above it are now wrong. Crazy.
These SSD's all have an OTP serial number. I am tempted to use that serial number as a label _I_ can control. And according to gparted, labels do not survive being incorporated into a raid as the raid is all labeled with hostname : partition number. So there really is no way in linux to define a drive that is that drive forever. Unreal...

Please get a USB 3.x HDD, do a full backup of your entire computer, put it off-site, get another USB 3.x HDD, do another full backup, and keep it nearby

That, using amanda is the end target of this. But I have bought 3 such spinning rust drives over the years and not had any survive being hot plugged into a usb port more than twice.

With that track record, I'll not waste any more money down that rabbit hole.


 >   But since I can't copy a locked file,


What file is lock?  Please post a console session that demonstrates.
A file that is opened but not closed is exclusive to that app and its lock, and cannot be copied except by rsync, or so I have been told. And there are quite a few such open locks on this system right now. This killed my full housed amiga when the boot drive with all its custom scripts died, and I found the backups I had were totally devoid of any of those scripts. I still have about 20 QIC tapes from that machine, but now no drives to read them. I need to cull the midden heap.


 > /dev/sde1 has been formatted and mounted, what cmd line will copy every
 > byte including locked files in that that raid10 to it?


See above for locked.  Otherwise, I suggest rsync(1).

[...]
Thank you David.

Cheers, Gene Heskett.
--
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author, 1940)
If we desire respect for the law, we must first make the law respectable.
 - Louis D. Brandeis

Reply via email to