Re: [OpenIndiana-discuss] Replacing an unavailable hd...
On 16/06/2014 3:18 PM, John Doe via openindiana-discuss wrote: I went onsite to replace the disk. zpool offline VOLUME c5t2d0 + unplug old, plug new And nothing happens... There was this in messages: Jun 14 02:30:12 data-4 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 2 satapkt 0xff06cd8cbc68 timed out Jun 14 02:30:23 data-4 ahci: [ID 860969 kern.warning] WARNING: ahci0: ahci_port_reset port 2 the device hardware has been initialized and the power-up diagnostics failed Jun 14 02:30:24 data-4 sata: [ID 801845 kern.info] /pci@0,0/pci15d9,62f@1f,2: Jun 14 02:30:24 data-4 SATA port 2 error Jun 14 02:30:24 data-4 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15d9,62f@1f,2/disk@2,0 (sd3): Jun 14 02:30:24 data-4 SYNCHRONIZE CACHE command failed (5) Jun 14 02:30:25 data-4 sata: [ID 801845 kern.info] /pci@0,0/pci15d9,62f@1f,2: Jun 14 02:30:25 data-4 SATA port 2 error ... I tried: cfgadm -cconnect xxx and get: cfgadm: Insufficient condition Nothing new in /var/adm/messages. cfgadm still says sata0/2 sata-port disconnected unconfigured failed I guess "sata-port" So I guess this sata port is bad (or just disabled?)... And of course, that's one of these servers that must stays on 24h/24... Is there a way to make the kernel try to reinitialize this sata port appart from a reboot? Try `fmadm faulty` Then `fmadm repaired ` Then cfgadm commands John ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Replacing an unavailable hd...
Thermal effects will cause connector issues. The most extreme case in my experience was a move of our group in Dallas during July. The systems were moved over the weekend. None would boot on Monday until I had reseated all the disk drive cables. They did not like sitting in a hot truck. I've seen lots of systems which had cable issues after a shutdown which allowed the hardware to cool down, but not as consistent (i.e. everyone in my group) as the system move case. Reg ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Replacing an unavailable hd...
I went onsite to replace the disk. zpool offline VOLUME c5t2d0 + unplug old, plug new And nothing happens... There was this in messages: Jun 14 02:30:12 data-4 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 2 satapkt 0xff06cd8cbc68 timed out Jun 14 02:30:23 data-4 ahci: [ID 860969 kern.warning] WARNING: ahci0: ahci_port_reset port 2 the device hardware has been initialized and the power-up diagnostics failed Jun 14 02:30:24 data-4 sata: [ID 801845 kern.info] /pci@0,0/pci15d9,62f@1f,2: Jun 14 02:30:24 data-4 SATA port 2 error Jun 14 02:30:24 data-4 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci15d9,62f@1f,2/disk@2,0 (sd3): Jun 14 02:30:24 data-4 SYNCHRONIZE CACHE command failed (5) Jun 14 02:30:25 data-4 sata: [ID 801845 kern.info] /pci@0,0/pci15d9,62f@1f,2: Jun 14 02:30:25 data-4 SATA port 2 error ... I tried: cfgadm -cconnect xxx and get: cfgadm: Insufficient condition Nothing new in /var/adm/messages. cfgadm still says sata0/2 sata-port disconnected unconfigured failed I guess "sata-port" So I guess this sata port is bad (or just disabled?)... And of course, that's one of these servers that must stays on 24h/24... Is there a way to make the kernel try to reinitialize this sata port appart from a reboot? ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] SMF service that should not shutdown with the OS
On Sun, Jun 15, 2014 at 10:12:14PM +0200, Jim Klimov wrote: > On 2014-06-15 15:01, Gary Mills wrote: > > > >What you need here is a remote console. > > That's for sure, in general. > > But on this box it does not have telnet (or rather, it has either > telnet or ssh - but not both at the same time), so I can not "telnet" > from our Cisco router to the firewall/vpn/infrastructure server's > RSC, and the Cisco SSH command (as of that firmware at least) does > not support connections into a non-default VRF (routing table). > I can however "telnet" anywhere, and I've put up a private VLAN where > this server can listen for "insecure" telnet accesses from this Cisco. > So it is all complicated :) Where I used to work, we had a physically separate console network. All of the RSC and ILOM ports connected to this network. We had one server that connected to both the console network and our computer room backbone, but did not route between the two. Once you logged into that server, you could telnet or SSH to any of the console cards. We ran conserver there to automate that part. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Replacing an unavailable hd...
On 06/16/2014 05:05 AM, John Doe via openindiana-discuss wrote: From: Jim Klimov If you are uncertain if the HDD device has really failed, you can also try to take apart the computer and remove-replug the power and signal cables, perhaps a few times. This may cleanse them of oxydation and repair the storage - happened to me dozens of times on both home-made rigs and brand servers (though rarely on the latter). Also, while you're near the box taken apart, you can listen to the disk if it "squeaks" and vibrates when powered on, or no longer works mechanically indeed. In fact, recently we've had a power outage that rebooted a couple of old servers who had a dead and unresponsive HDD each (the poor boxes waited for replacements to be purchased and received), and now the disks are back online - several scrubs found no problems (that is, after the initial resilver/scrub which complained a lot due to lots of stale data). So there was even no mechanical replugging, just a power cycle. Hum... the server is a bit less than 2 years old, and all disks are plugged on the same back-plane, so I would be surprised if it was a cabling issue. And, since I have spares, I prefer to replace the suspect disk to be safe and test it later... If it is indeed a cabling issue, the new disk will also look failed I guess. If it helps, cfgadm -alv says: sata0/2 disconnected unconfigured failed I did not yet go onsite to witness if there is any red led. After a moving the server to temporary quarters, I had a similar situation. I was able to bring the disk back up remotely (several hours drive away) with the following commands: cfgadm -cunconnect cfgadm -cconnect cfgadm -cconfigure This situation kept happening (failing between a day and a week later), even when I finally moved it to new quarters. I then re-seated the controller card and cables. It's been running for several months without a complaint. I did see that sometimes when the disk failed it would even show up as not connected. I tried rebooting for awhile which sometimes worked, but I was much more successful with the above commands and it didn't require bringing the machine offline. Even without the move, there is enough vibration to cause problems with marginal connections so I concur with Jim that it could occur. Gary ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Replacing an unavailable hd...
On 2014-06-16 11:05, John Doe via openindiana-discuss wrote: Hum... the server is a bit less than 2 years old, and all disks are plugged on the same back-plane, so I would be surprised if it was a cabling issue. And, since I have spares, I prefer to replace the suspect disk to be safe and test it later... If it is indeed a cabling issue, the new disk will also look failed I guess. Not necessarily. In case of oxydation, the problem is intermittent and will probably be fixed by the act of re-plugging, which scratches off the oxyde films on contacts and brings the metal parts back into better contact. If this is the case, however, then it is likely that the situation will recur roughly each year, and can be "fixed" in the same way. Thanks to ZFS, at least, if there are any botched data packets between the HDD and HBA due to weak signal/interference/noise when the film is growing but does not yet preclude communications completely, then you'd see checksum errors on the device (if there are mismatches indeed) and the pool will try to amend that automatically. It may also be possible that SAS/SCSI protocol also includes some error correction and automatic retries for such in-flight mistakes, though there is likely nothing like that for SATA/IDE protocols. HTH, Jim Klimov ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Replacing an unavailable hd...
From: Jim Klimov > If you are uncertain if the HDD device has really failed, you can also > try to take apart the computer and remove-replug the power and signal > cables, perhaps a few times. This may cleanse them of oxydation and > repair the storage - happened to me dozens of times on both home-made > rigs and brand servers (though rarely on the latter). > > Also, while you're near the box taken apart, you can listen to the > disk if it "squeaks" and vibrates when powered on, or no longer works > mechanically indeed. > > In fact, recently we've had a power outage that rebooted a couple of > old servers who had a dead and unresponsive HDD each (the poor boxes > waited for replacements to be purchased and received), and now the > disks are back online - several scrubs found no problems (that is, > after the initial resilver/scrub which complained a lot due to lots > of stale data). So there was even no mechanical replugging, just a > power cycle. Hum... the server is a bit less than 2 years old, and all disks are plugged on the same back-plane, so I would be surprised if it was a cabling issue. And, since I have spares, I prefer to replace the suspect disk to be safe and test it later... If it is indeed a cabling issue, the new disk will also look failed I guess. If it helps, cfgadm -alv says: sata0/2 disconnected unconfigured failed I did not yet go onsite to witness if there is any red led. Thx, JD ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Replacing an unavailable hd...
On 2014-06-16 10:18, John Doe via openindiana-discuss wrote: Hi, I just got my first failed HD under OI and wanted to be sure I don't crash everything when replacing it... If you are uncertain if the HDD device has really failed, you can also try to take apart the computer and remove-replug the power and signal cables, perhaps a few times. This may cleanse them of oxydation and repair the storage - happened to me dozens of times on both home-made rigs and brand servers (though rarely on the latter). Also, while you're near the box taken apart, you can listen to the disk if it "squeaks" and vibrates when powered on, or no longer works mechanically indeed. In fact, recently we've had a power outage that rebooted a couple of old servers who had a dead and unresponsive HDD each (the poor boxes waited for replacements to be purchased and received), and now the disks are back online - several scrubs found no problems (that is, after the initial resilver/scrub which complained a lot due to lots of stale data). So there was even no mechanical replugging, just a power cycle. HTH, //Jim ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
[OpenIndiana-discuss] Replacing an unavailable hd...
Hi, I just got my first failed HD under OI and wanted to be sure I don't crash everything when replacing it... # zpool status -x pool: VOLUME state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: scrub repaired 0 in 0h0m with 0 errors on Fri Jan 25 14:09:17 2013 config: NAME STATE READ WRITE CKSUM VOLUME DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c5t0d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c5t2d0 UNAVAIL 12 20 0 cannot open c5t3d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 Do I just need to physicaly replace the HD and then do a: zpool replace VOLUME c5t2d0 or do I need to encapsulate that with some zpool offline/online, "cfgadm -c unconfigure/configure", followed by a "zpool clear"...? Thx, JD ___ openindiana-discuss mailing list openindiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss