subject:"\[ceph\-users\] jbod \+ SMART \: how to identify failing disks \?"

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-12-11 Thread SCHAER Frederic

Hi,

Back on this.
I finally found out a logic in the mapping.

So after taking the time to note all the disks serial numbers on 3 different 
machines and 2 different OSes, I now know that my specific LSI SAS 2008 cards 
(no reference on them, but I think those are LSI sas 9207-8i) map the disks of 
the MD1000  in the reverse alphabetic order :

sd{b..p} map to slot{14..0}

There is absolutely nothing else that appears usable, except the sas_address of 
the disks which seems associated with slots. 
But even this one is different depending on machines, and the address - slot 
mapping does not seem very obvious at the very least...

Good thing is that I now know that fun tools exist in packages such as 
sg3_tils, smp_utils and others like mpt-status...
Next step is to try an md1200 ;)

Thanks again
Cheers

-Message d'origine-
De : JF Le Fillâtre [mailto:jean-francois.lefilla...@uni.lu] 
Envoyé : mercredi 19 novembre 2014 13:42
À : SCHAER Frederic
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ?


Hello again,

So whatever magic allows the Dell MD1200 to report the slot position for
each disk isn't present in your JBODs. Time for something else.

There are two sides to your problem:

1) Identifying which disk is where in your JBOD

Quite easy. Again I'd go for a udev rule + script that will either
rename the disks entirely, or create a symlink with a name like
jbodX-slotY or something to figure out easily which is which. The
mapping end-device-to-slot can be static in the script, so you need to
identify once the order in which the kernel scans the slots and then you
can map.

But it won't survive a disk swap or a change of scanning order from a
kernel upgrade, so it's not enough.

2) Finding a way of identification independent of hot-plugs and scan order

That's the tricky part. If you remove a disk from your JBOD and replace
it with another one, the other one will get another sdX name, and in
my experience even another end_device-... name. But given that you
want the new disk to have the exact same name or symlink as the previous
one, you have to find something in the path of the device or (better) in
the udev attributes that is immutable.

If possible at all, it will depend on your specific hardware
combination, so you will have to try for yourself.

Suggested methodology:

1) write down the serial number of one drive in any slot, and figure out
its device name (sdX) with smartctl -i /dev/sd...

2) grab the detailed /sys path name and list of udev attributes:
readlink -f /sys/class/block/sdX
udevadm info --attribute-walk /dev/sdX

3) pull that disk and replace it. Check the logs to see which is its new
device name (sdY)

4) rerun the commands from #2 with sdY

5) compare the outputs and find something in the path or in the
attributes that didn't change and is unique to that disk (ie not a
common parent for example).

If you have something that really didn't change, you're in luck. Either
use the serial numbers or unplug and replug all disks one by one to
figure out the mapping slot number / immutable item.

Then write the udev rule. :)

Thanks!
JF



On 19/11/14 11:29, SCHAER Frederic wrote:
 Hi
 
 Thanks.
 I hoped it would be it, but no ;)
 
 With this mapping :
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdb - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:0/end_device-1:1:0/target1:0:1/1:0:1:0/block/sdb
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdc - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:1/end_device-1:1:1/target1:0:2/1:0:2:0/block/sdc
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdd - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:2/end_device-1:1:2/target1:0:3/1:0:3:0/block/sdd
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sde - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:3/end_device-1:1:3/target1:0:4/1:0:4:0/block/sde
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdf - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:4/end_device-1:1:4/target1:0:5/1:0:5:0/block/sdf
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdg - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:5/end_device-1:1:5/target1:0:6/1:0:6:0/block/sdg
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdh - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:6/end_device-1:1:6/target1:0:7/1:0:7:0/block/sdh
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdi - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-19 Thread SCHAER Frederic

Hi

Thanks.
I hoped it would be it, but no ;)

With this mapping :
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdb - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:0/end_device-1:1:0/target1:0:1/1:0:1:0/block/sdb
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdc - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:1/end_device-1:1:1/target1:0:2/1:0:2:0/block/sdc
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdd - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:2/end_device-1:1:2/target1:0:3/1:0:3:0/block/sdd
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sde - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:3/end_device-1:1:3/target1:0:4/1:0:4:0/block/sde
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdf - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:4/end_device-1:1:4/target1:0:5/1:0:5:0/block/sdf
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdg - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:5/end_device-1:1:5/target1:0:6/1:0:6:0/block/sdg
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdh - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:6/end_device-1:1:6/target1:0:7/1:0:7:0/block/sdh
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdi - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:7/end_device-1:1:7/target1:0:8/1:0:8:0/block/sdi
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdj - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:0/end_device-1:2:0/target1:0:9/1:0:9:0/block/sdj
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdk - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:1/end_device-1:2:1/target1:0:10/1:0:10:0/block/sdk
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdl - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:2/end_device-1:2:2/target1:0:11/1:0:11:0/block/sdl
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdm - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:3/end_device-1:2:3/target1:0:12/1:0:12:0/block/sdm
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdn - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:4/end_device-1:2:4/target1:0:13/1:0:13:0/block/sdn
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdo - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:5/end_device-1:2:5/target1:0:14/1:0:14:0/block/sdo
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdp - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:6/end_device-1:2:6/target1:0:15/1:0:15:0/block/sdp

sdd was on physical slot 12, sdk was on slot 5, and sdg was on slot 9 (and I 
did not check the others)...
so clearly this cannot be put in production as is and I'll have to find a way.

Regards


-Message d'origine-
De : Carl-Johan Schenström [mailto:carl-johan.schenst...@gu.se] 
Envoyé : lundi 17 novembre 2014 14:14
À : SCHAER Frederic; Scottix; Erik Logtenberg
Cc : ceph-users@lists.ceph.com
Objet : RE: [ceph-users] jbod + SMART : how to identify failing disks ?

Hi!

I'm fairly sure that the link targets in /sys/class/block were correct the last 
time I had to change a drive on a system with a Dell HBA connected to an 
MD1000, but perhaps I was just lucky. =/

I.e.,

# ls -l /sys/class/block/sdj
lrwxrwxrwx. 1 root root 0 17 nov 13.54 /sys/class/block/sdj - 
../../devices/pci:20/:20:0a.0/:21:00.0/host7/port-7:0/expander-7:0/port-7:0:1/expander-7:2/port-7:2:6/end_device-7:2:6/target7:0:7/7:0:7:0/block/sdj

would be first port on HBA, first expander, 7th slot (6, starting from 0). 
Don't take my word for it, though!

-- 
Carl-Johan Schenström
Driftansvarig / System Administrator
Språkbanken  Svensk nationell datatjänst /
The Swedish Language Bank  Swedish National Data Service
Göteborgs universitet / University of Gothenburg
carl-johan.schenst...@gu.se / +46 709 116769


From: ceph-users ceph-users-boun...@lists.ceph.com on behalf of SCHAER 
Frederic frederic.sch...@cea.fr
Sent: Friday, November 14, 2014 17:24
To: Scottix; Erik Logtenberg
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-18 Thread SCHAER Frederic

Wow. Thanks
Not very operations friendly though…

Wouldn’t it be just OK to pull the disk that we think is the bad one, check the 
serial number, and if not, just replug and let the udev rules do their job and 
re-insert the disk in the ceph cluster ?
(provided XFS doesn’t freeze for good when we do that)

Regards

De : Craig Lewis [mailto:cle...@centraldesktop.com]
Envoyé : lundi 17 novembre 2014 22:32
À : SCHAER Frederic
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ?

I use `dd` to force activity to the disk I want to replace, and watch the 
activity lights.  That only works if your disks aren't 100% busy.  If they are, 
stop the ceph-osd daemon, and see which drive stops having activity.  Repeat 
until you're 100% confident that you're pulling the right drive.

On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic 
frederic.sch...@cea.frmailto:frederic.sch...@cea.fr wrote:
Hi,

I’m used to RAID software giving me the failing disks  slots, and most often 
blinking the disks on the disk bays.
I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 
one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) .

Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but 
MegaCli does not see the HBA card.
Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me 
the disk slots (it finds the disks, serials and others, but slot is always 0)
Because of this, I’m going to head over to the disk bay and unplug the disk 
which I think corresponds to the alphabetical order in linux, and see if it’s 
the correct one…. But even if this is correct this time, it might not be next 
time.

But this makes me wonder : how do you guys, Ceph users, manage your disks if 
you really have JBOD servers ?
I can’t imagine having to guess slots that each time, and I can’t imagine 
neither creating serial number stickers for every single disk I could have to 
manage …
Is there any specific advice reguarding JBOD cards people should (not) use in 
their systems ?
Any magical way to “blink” a drive in linux ?

Thanks  regards

___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Craig Lewis

I use `dd` to force activity to the disk I want to replace, and watch the
activity lights.  That only works if your disks aren't 100% busy.  If they
are, stop the ceph-osd daemon, and see which drive stops having activity.
Repeat until you're 100% confident that you're pulling the right drive.

On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr
wrote:

  Hi,



 I’m used to RAID software giving me the failing disks  slots, and most
 often blinking the disks on the disk bays.

 I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T)
 .



 Since this is an LSI, I thought I’d use MegaCli to identify the disks
 slot, but MegaCli does not see the HBA card.

 Then I found the LSI “sas2ircu” utility, but again, this one fails at
 giving me the disk slots (it finds the disks, serials and others, but slot
 is always 0)

 Because of this, I’m going to head over to the disk bay and unplug the
 disk which I think corresponds to the alphabetical order in linux, and see
 if it’s the correct one…. But even if this is correct this time, it might
 not be next time.



 But this makes me wonder : how do you guys, Ceph users, manage your disks
 if you really have JBOD servers ?

 I can’t imagine having to guess slots that each time, and I can’t imagine
 neither creating serial number stickers for every single disk I could have
 to manage …

 Is there any specific advice reguarding JBOD cards people should (not) use
 in their systems ?

 Any magical way to “blink” a drive in linux ?



 Thanks  regards

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Cedric Lemarchand

Hi,

Try looking for file locate in a folder named Slot X where X in the
number of the slot, then echoing 1 in the locate file will make the
led blink. :

# find /sys -name locate  |grep Slot
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
01/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
02/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
03/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
04/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
05/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
06/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
07/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
08/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
09/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
10/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
11/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
12/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
13/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
14/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
15/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
16/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
17/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
18/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
19/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
20/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
21/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
22/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
23/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
24/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
25/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
26/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
27/locate
/sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
28/locate

LSI 9200-8e with a Supermicro JBOD 28 slots, Ubuntu 12.04, 3.13 kernel.

Cheers


Le 12/11/2014 14:05, SCHAER Frederic a écrit :

 Hi,

  

 I’m used to RAID software giving me the failing disks  slots, and most

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Cedric Lemarchand

Sorry, I forgot to say that in Slot X/device/block you could find the
device name, like sdc.

Cheers



Le 18/11/2014 00:15, Cedric Lemarchand a écrit :
 Hi,

 Try looking for file locate in a folder named Slot X where X in
 the number of the slot, then echoing 1 in the locate file will make
 the led blink. :

 # find /sys -name locate  |grep Slot
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 01/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 02/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 03/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 04/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 05/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 06/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 07/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 08/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 09/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 10/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 11/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 12/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 13/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 14/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 15/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 16/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 17/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 18/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 19/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 20/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 21/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 22/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 23/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 24/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 25/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 26/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 27/locate
 /sys/devices/pci:00/:00:03.0/:06:00.0/host6/port-6:0/expander-6:0/port-6:0:28/end_device-6:0:28/target6:0:28/6:0:28:0/enclosure/6:0:28:0/Slot
 28/locate

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Christian Balzer


On Mon, 17 Nov 2014 13:31:57 -0800 Craig Lewis wrote:

 I use `dd` to force activity to the disk I want to replace, and watch the
 activity lights.  That only works if your disks aren't 100% busy.  If
 they are, stop the ceph-osd daemon, and see which drive stops having
 activity. Repeat until you're 100% confident that you're pulling the
 right drive.

I use smartctl for lighting up the disk, but same diff. 
JBOD can become a big PITA quickly with large deployments and if you don't
have people with sufficient skill doing disk replacements.

Also depending on how a disk died you might not be able to reclaim the
drive ID (sdc for example) without a reboot, making things even more
confusing. 

Some RAID cards in IT/JBOD mode _will_ actually light up the fail LED if
a disk fails and/or have tools to blink a specific disk. 
However with the later the task of matching a disk from the controller's
perspective to what linux enumerated it as is still on you.

Ceph might scale up to really large deployments, but you better have a
well staffed data center to come with that or deploy it in a non-JBOD
fashion. 

Christian

 On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr
 wrote:
 
   Hi,
 
 
 
  I’m used to RAID software giving me the failing disks  slots, and most
  often blinking the disks on the disk bays.
 
  I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
  2008 one, and I now have to identify 3 pre-failed disks (so says
  S.M.A.R.T) .
 
 
 
  Since this is an LSI, I thought I’d use MegaCli to identify the disks
  slot, but MegaCli does not see the HBA card.
 
  Then I found the LSI “sas2ircu” utility, but again, this one fails at
  giving me the disk slots (it finds the disks, serials and others, but
  slot is always 0)
 
  Because of this, I’m going to head over to the disk bay and unplug the
  disk which I think corresponds to the alphabetical order in linux, and
  see if it’s the correct one…. But even if this is correct this time,
  it might not be next time.
 
 
 
  But this makes me wonder : how do you guys, Ceph users, manage your
  disks if you really have JBOD servers ?
 
  I can’t imagine having to guess slots that each time, and I can’t
  imagine neither creating serial number stickers for every single disk
  I could have to manage …
 
  Is there any specific advice reguarding JBOD cards people should (not)
  use in their systems ?
 
  Any magical way to “blink” a drive in linux ?
 
 
 
  Thanks  regards
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread SCHAER Frederic

Hi,

I'm used to RAID software giving me the failing disks  slots, and most often 
blinking the disks on the disk bays.
I recently installed a  DELL 6GB HBA SAS JBOD card, said to be an LSI 2008 
one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) .

Since this is an LSI, I thought I'd use MegaCli to identify the disks slot, but 
MegaCli does not see the HBA card.
Then I found the LSI sas2ircu utility, but again, this one fails at giving me 
the disk slots (it finds the disks, serials and others, but slot is always 0)
Because of this, I'm going to head over to the disk bay and unplug the disk 
which I think corresponds to the alphabetical order in linux, and see if it's 
the correct one But even if this is correct this time, it might not be next 
time.

But this makes me wonder : how do you guys, Ceph users, manage your disks if 
you really have JBOD servers ?
I can't imagine having to guess slots that each time, and I can't imagine 
neither creating serial number stickers for every single disk I could have to 
manage ...
Is there any specific advice reguarding JBOD cards people should (not) use in 
their systems ?
Any magical way to blink a drive in linux ?

Thanks  regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread Erik Logtenberg

I have no experience with the DELL SAS controller, but usually the
advantage of using a simple controller (instead of a RAID card) is that
you can use full SMART directly.

$ sudo smartctl -a /dev/sda

=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSA2BW300G3H
Serial Number:PEPR2381003E300EGN

Personally, I make sure that I know which serial number drive is in
which bay, so I can easily tell which drive I'm talking about.

So you can use SMART both to notice (pre)failing disks -and- to
physically identify them.

The same smartctl command also returns the health status like so:

233 Media_Wearout_Indicator 0x0032   099   099   000Old_age   Always
  -   0

This specific SSD has 99% media lifetime left, so it's in the green. But
it will continue to gradually degrade, and at some time It'll hit a
percentage where I like to replace it. To keep an eye on the speed of
decay, I'm graphing those SMART values in Cacti. That way I can somewhat
predict how long a disk will last, especially SSD's which die very
gradually.

Erik.


On 12-11-14 14:43, JF Le Fillâtre wrote:
 
 Hi,
 
 May or may not work depending on your JBOD and the way it's identified
 and set up by the LSI card and the kernel:
 
 cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier
 
 The weird path and the wildcards are due to the way the sysfs is set up.
 
 That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running
 CentOS release 6.5.
 
 Note that you can make your life easier by writing an udev script that
 will create a symlink with a sane identifier for each of your external
 disks. If you match along the lines of
 
 KERNEL==sd*[a-z], KERNELS==end_device-*:*:*
 
 then you'll just have to cat /sys/class/sas_device/${1}/bay_identifier
 in a script (with $1 being the $id of udev after that match, so the
 string end_device-X:Y:Z) to obtain the bay ID.
 
 Thanks,
 JF
 
 
 
 On 12/11/14 14:05, SCHAER Frederic wrote:
 Hi,

  

 I’m used to RAID software giving me the failing disks  slots, and most
 often blinking the disks on the disk bays.

 I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
 2008 one, and I now have to identify 3 pre-failed disks (so says
 S.M.A.R.T) .

  

 Since this is an LSI, I thought I’d use MegaCli to identify the disks
 slot, but MegaCli does not see the HBA card.

 Then I found the LSI “sas2ircu” utility, but again, this one fails at
 giving me the disk slots (it finds the disks, serials and others, but
 slot is always 0)

 Because of this, I’m going to head over to the disk bay and unplug the
 disk which I think corresponds to the alphabetical order in linux, and
 see if it’s the correct one…. But even if this is correct this time, it
 might not be next time.

  

 But this makes me wonder : how do you guys, Ceph users, manage your
 disks if you really have JBOD servers ?

 I can’t imagine having to guess slots that each time, and I can’t
 imagine neither creating serial number stickers for every single disk I
 could have to manage …

 Is there any specific advice reguarding JBOD cards people should (not)
 use in their systems ?

 Any magical way to “blink” a drive in linux ?

  

 Thanks  regards



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread Scottix

I would say it depends on your system and where drives are connected
to. Some HBA have a cli tool to manage the drives connected like a
raid card would do.
One other method I found is sometimes it will expose the leds for you
http://fabiobaltieri.com/2011/09/21/linux-led-subsystem/ has an
article on the /sys/class/led but not guarantee.

On my laptop I could turn on lights and stuff but our server didn't
have anything. Seems like a feature either linux or smartctrl should
have. I have ran into this problem before but did a couple tricks to
figure it out.

I guess best solution is just to track the drives S/N. Maybe a good
note to have in the doc for a Ceph cluster to be aware of.

On Wed, Nov 12, 2014 at 9:06 AM, Erik Logtenberg e...@logtenberg.eu wrote:
 I have no experience with the DELL SAS controller, but usually the
 advantage of using a simple controller (instead of a RAID card) is that
 you can use full SMART directly.

 $ sudo smartctl -a /dev/sda

 === START OF INFORMATION SECTION ===
 Device Model: INTEL SSDSA2BW300G3H
 Serial Number:PEPR2381003E300EGN

 Personally, I make sure that I know which serial number drive is in
 which bay, so I can easily tell which drive I'm talking about.

 So you can use SMART both to notice (pre)failing disks -and- to
 physically identify them.

 The same smartctl command also returns the health status like so:

 233 Media_Wearout_Indicator 0x0032   099   099   000Old_age   Always
   -   0

 This specific SSD has 99% media lifetime left, so it's in the green. But
 it will continue to gradually degrade, and at some time It'll hit a
 percentage where I like to replace it. To keep an eye on the speed of
 decay, I'm graphing those SMART values in Cacti. That way I can somewhat
 predict how long a disk will last, especially SSD's which die very
 gradually.

 Erik.


 On 12-11-14 14:43, JF Le Fillâtre wrote:

 Hi,

 May or may not work depending on your JBOD and the way it's identified
 and set up by the LSI card and the kernel:

 cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier

 The weird path and the wildcards are due to the way the sysfs is set up.

 That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running
 CentOS release 6.5.

 Note that you can make your life easier by writing an udev script that
 will create a symlink with a sane identifier for each of your external
 disks. If you match along the lines of

 KERNEL==sd*[a-z], KERNELS==end_device-*:*:*

 then you'll just have to cat /sys/class/sas_device/${1}/bay_identifier
 in a script (with $1 being the $id of udev after that match, so the
 string end_device-X:Y:Z) to obtain the bay ID.

 Thanks,
 JF



 On 12/11/14 14:05, SCHAER Frederic wrote:
 Hi,



 I’m used to RAID software giving me the failing disks  slots, and most
 often blinking the disks on the disk bays.

 I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
 2008 one, and I now have to identify 3 pre-failed disks (so says
 S.M.A.R.T) .



 Since this is an LSI, I thought I’d use MegaCli to identify the disks
 slot, but MegaCli does not see the HBA card.

 Then I found the LSI “sas2ircu” utility, but again, this one fails at
 giving me the disk slots (it finds the disks, serials and others, but
 slot is always 0)

 Because of this, I’m going to head over to the disk bay and unplug the
 disk which I think corresponds to the alphabetical order in linux, and
 see if it’s the correct one…. But even if this is correct this time, it
 might not be next time.



 But this makes me wonder : how do you guys, Ceph users, manage your
 disks if you really have JBOD servers ?

 I can’t imagine having to guess slots that each time, and I can’t
 imagine neither creating serial number stickers for every single disk I
 could have to manage …

 Is there any specific advice reguarding JBOD cards people should (not)
 use in their systems ?

 Any magical way to “blink” a drive in linux ?



 Thanks  regards



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Follow Me: @Taijutsun
scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

[ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

10 matches

Site Navigation

Mail list logo

Footer information