Re: [ceph-users] Power outages!!! help!

Willem Jan Withagen Tue, 29 Aug 2017 12:49:44 -0700

On 29-8-2017 19:12, Steve Taylor wrote:
> Hong,
> 
> Probably your best chance at recovering any data without special,
> expensive, forensic procedures is to perform a dd from /dev/sdb to
> somewhere else large enough to hold a full disk image and attempt to
> repair that. You'll want to use 'conv=noerror' with your dd command
> since your disk is failing. Then you could either re-attach the OSD
> from the new source or attempt to retrieve objects from the filestore
> on it.


Like somebody else already pointed out
In problem "cases like disk, use dd_rescue.
It has really a far better chance of restoring a copy of your disk

--WjW

> I have actually done this before by creating an RBD that matches the
> disk size, performing the dd, running xfs_repair, and eventually
> adding it back to the cluster as an OSD. RBDs as OSDs is certainly a
> temporary arrangement for repair only, but I'm happy to report that it
> worked flawlessly in my case. I was able to weight the OSD to 0,
> offload all of its data, then remove it for a full recovery, at which
> point I just deleted the RBD.
> 
> The possibilities afforded by Ceph inception are endless. ☺
> 
> 
>  
> Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 | 
>  
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
>  
> 
> On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote:
>> Rule of thumb with batteries is:
>> - more “proper temperature” you run them at the more life you get out
>> of them
>> - more battery is overpowered for your application the longer it will
>> survive. 
>>
>> Get your self a LSI 94** controller and use it as HBA and you will be
>> fine. but get MORE DRIVES !!!!! … 
>>> On 28 Aug 2017, at 23:10, hjcho616 <hjcho...@yahoo.com> wrote:
>>>
>>> Thank you Tomasz and Ronny.  I'll have to order some hdd soon and
>>> try these out.  Car battery idea is nice!  I may try that.. =)  Do
>>> they last longer?  Ones that fit the UPS original battery spec
>>> didn't last very long... part of the reason why I gave up on them..
>>> =P  My wife probably won't like the idea of car battery hanging out
>>> though ha!
>>>
>>> The OSD1 (one with mostly ok OSDs, except that smart failure)
>>> motherboard doesn't have any additional SATA connectors available.
>>>  Would it be safe to add another OSD host?
>>>
>>> Regards,
>>> Hong
>>>
>>>
>>>
>>> On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g
>>> mail.com> wrote:
>>>
>>>
>>> Sorry for being brutal … anyway 
>>> 1. get the battery for UPS ( a car battery will do as well, I’ve
>>> moded on ups in the past with truck battery and it was working like
>>> a charm :D )
>>> 2. get spare drives and put those in because your cluster CAN NOT
>>> get out of error due to lack of space
>>> 3. Follow advice of Ronny Aasen on hot to recover data from hard
>>> drives 
>>> 4 get cooling to drives or you will loose more ! 
>>>
>>>
>>>> On 28 Aug 2017, at 22:39, hjcho616 <hjcho...@yahoo.com> wrote:
>>>>
>>>> Tomasz,
>>>>
>>>> Those machines are behind a surge protector.  Doesn't appear to
>>>> be a good one!  I do have a UPS... but it is my fault... no
>>>> battery.  Power was pretty reliable for a while... and UPS was
>>>> just beeping every chance it had, disrupting some sleep.. =P  So
>>>> running on surge protector only.  I am running this in home
>>>> environment.   So far, HDD failures have been very rare for this
>>>> environment. =)  It just doesn't get loaded as much!  I am not
>>>> sure what to expect, seeing that "unfound" and just a feeling of
>>>> possibility of maybe getting OSD back made me excited about it.
>>>> =) Thanks for letting me know what should be the priority.  I
>>>> just lack experience and knowledge in this. =) Please do continue
>>>> to guide me though this. 
>>>>
>>>> Thank you for the decode of that smart messages!  I do agree that
>>>> looks like it is on its way out.  I would like to know how to get
>>>> good portion of it back if possible. =)
>>>>
>>>> I think I just set the size and min_size to 1.
>>>> # ceph osd lspools
>>>> 0 data,1 metadata,2 rbd,
>>>> # ceph osd pool set rbd size 1
>>>> set pool 2 size to 1
>>>> # ceph osd pool set rbd min_size 1
>>>> set pool 2 min_size to 1
>>>>
>>>> Seems to be doing some backfilling work.
>>>>
>>>> # ceph health
>>>> HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 2
>>>> pgs backfill_toofull; 74 pgs backfill_wait; 3 pgs backfilling;
>>>> 108 pgs degraded; 6 pgs down; 6 pgs inconsistent; 6 pgs peering;
>>>> 7 pgs recovery_wait; 16 pgs stale; 108 pgs stuck degraded; 6 pgs
>>>> stuck inactive; 16 pgs stuck stale; 130 pgs stuck unclean; 101
>>>> pgs stuck undersized; 101 pgs undersized; 1 requests are blocked
>>>>> 32 sec; recovery 1790657/4502340 objects degraded (39.772%);
>>>> recovery 641906/4502340 objects misplaced (14.257%); recovery
>>>> 147/2251990 unfound (0.007%); 50 scrub errors; mds cluster is
>>>> degraded; no legacy OSD present but 'sortbitwise' flag is not set
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Hong
>>>>
>>>>
>>>> On Monday, August 28, 2017 4:18 PM, Tomasz Kusmierz <tom.kusmierz
>>>> @gmail.com> wrote:
>>>>
>>>>
>>>> So to decode few things about your disk:
>>>>
>>>>   1 Raw_Read_Error_Rate    0x002f  100  100  051    Pre-fail 
>>>> Always      -      37
>>>> 37 read erros and only one sector marked as pending - fun disk
>>>> :/ 
>>>>
>>>> 181 Program_Fail_Cnt_Total  0x0022  099  099  000    Old_age 
>>>> Always      -      35325174
>>>> So firmware has quite few bugs, that’s nice
>>>>
>>>> 191 G-Sense_Error_Rate      0x0022  100  100  000    Old_age 
>>>> Always      -      2855
>>>> disk was thrown around while operational even more nice.
>>>>
>>>> 194 Temperature_Celsius    0x0002  047  041  000    Old_age 
>>>> Always      -      53 (Min/Max 15/59)
>>>> if your disk passes 50 you should not consider using it, high
>>>> temperatures demagnetise plate layer and you will see more errors
>>>> in very near future.
>>>>
>>>> 197 Current_Pending_Sector  0x0032  100  100  000    Old_age 
>>>> Always      -      1
>>>> as mentioned before :)
>>>>
>>>> 200 Multi_Zone_Error_Rate  0x002a  100  100  000    Old_age 
>>>> Always      -      4222
>>>> your heads keep missing tracks … bent ? I don’t even know how to
>>>> comment here.
>>>>
>>>>
>>>> generally fun drive you’ve got there … rescue as much as you can
>>>> and throw it away !!!
>>>>
>>>>
>>>
>>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

Reply via email to