Just to circle back to this:

Drives: Seagate ST8000NM0065
Controller: LSI 3108 RAID-on-Chip
At the time, no BBU on RoC controller.
Each OSD drive was configured as a single RAID0 VD.

What I believe to be the snake that bit us was the Seagate drives’ on-board 
caching.

Using storcli to manage the controller/drive, the pdcache value for /cx/vx was 
set to default, which in this case is on.

So now all of the VD’s have the pdcache value set to off.

At the time the controller’s write-cache setting was also set to write back, 
and has since been set to write-through until BBU’s are installed.

Below is an example of our current settings in use post power-event:

> $ sudo /opt/MegaRAID/storcli/storcli64 /c0/v0 show all
> Controller = 0
> Status = Success
> Description = None
> 
> 
> /c0/v0 :
> ======
> 
> --------------------------------------------------------------
> DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
> --------------------------------------------------------------
> 0/0   RAID0 Optl  RW     Yes     RWTD  -   ON  7.276 TB ceph1
> --------------------------------------------------------------
> 
> Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
> Optl=Optimal|RO=Read Only|RW=Read 
> Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
> Consist=ConsistentR=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
> AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
> Check Consistency
> 
> 
> PDs for VD 0 :
> ============
> 
> -----------------------------------------------------------------------
> EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp
> -----------------------------------------------------------------------
> 252:0     9 Onln   0 7.276 TB SAS  HDD N   N  4 KB ST8000NM0065     U
> -----------------------------------------------------------------------
> 
> EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
> DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
> UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
> Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
> SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
> UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
> CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
> 
> 
> VD0 Properties :
> ==============
> Strip Size = 256 KB
> Number of Blocks = 1953374208
> VD has Emulated PD = No
> Span Depth = 1
> Number of Drives Per Span = 1
> Write Cache(initial setting) = WriteThrough
> Disk Cache Policy = Disabled
> Encryption = None
> Data Protection = Disabled
> Active Operations = None
> Exposed to OS = Yes
> Creation Date = 17-06-2016
> Creation Time = 02:49:02 PM
> Emulation type = default
> Cachebypass size = Cachebypass-64k
> Cachebypass Mode = Cachebypass Intelligent
> Is LD Ready for OS Requests = Yes
> SCSI NAA Id = 600304801bb4c0001ef6ca5ea0fcb283


Hopefully this configuration is a much safer configuration, and can help anyone 
else before incurring any destructive issues.

The only less than great part of this configuration is the hit to write I/O due 
to less than optimal write scheduling compared to cached writes. Hope to enable 
write-back at the controller level after BBU installation.

Thanks,

Reed

> On Sep 1, 2016, at 6:21 AM, Cloud List <cloud-l...@sg.or.id> wrote:
> 
> 
> 
> On Thu, Sep 1, 2016 at 3:50 PM, Nick Fisk <n...@fisk.me.uk 
> <mailto:n...@fisk.me.uk>> wrote:
> > > Op 31 augustus 2016 om 23:21 schreef Reed Dier <reed.d...@focusvq.com 
> > > <mailto:reed.d...@focusvq.com>>:
> > >
> > >
> > > Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
> > > write cache settings which have been adjusted now.
> 
> Reed, I realise that you are probably very busy attempting recovery at the 
> moment, but when things calm down, I think it would be very beneficial to the 
> list if you could expand on what settings caused this to happen. It might 
> just stop this happening to someone else in the future.
> 
> Agree with Nick, when things settle down and (hopefully) all the data is 
> recovered, appreciate if Reed can share what kinid of write cache settings 
> can cause this problem and what adjustment was made to prevent this kind of 
> problem from happening.
> 
> Thank you.
> 
> -ip-

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to