[ceph-users] Many concurrent drive failures - How do I activate pgs?
HEALTH_WARN noout flag(s) set; Reduced data availability: 3 pgs inactive, 3 pgs down; Degraded data redundancy: 3 pgs unclean OSDMAP_FLAGS noout flag(s) set PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 3 pgs down pg 4.43 is down, acting [4,15,18] pg 7.4 is down, acting [8,5,21,18,15,0] pg 7.f is down, acting [23,0,16,5,11,14] PG_DEGRADED Degraded data redundancy: 3 pgs unclean pg 4.43 is stuck unclean since forever, current state down, last acting [4,15,18] pg 7.4 is stuck unclean since forever, current state down, last acting [8,5,21,18,15,0] pg 7.f is stuck unclean since forever, current state down, last acting [23,0,16,5,11,14] Original 'ceph pg X query' status (before we mucked around by exporting and deleting OSDs): https://pastebin.com/fBQhq6UQ Current 'ceph pg X query' status (after recreating temporary OSDs with the original IDs and importing the exports): https://pastebin.com/qcN5uYkN What we assume needs to be done: - Tell Ceph that the OSDs are lost (query status in the pastebin above reports 'starting or marking this osd lost may let us proceed'). We have stopped, marked the temporary OSDs as out and run 'ceph osd lost $ID --yes-i-really-mean-it' already though. - Somehow get Ceph to forget about the sharded objects it doesn't have sufficient pieces of. - Copy the images to another pool so that we can get pieces of data off these and rebuild those systems. - Hopefully get Intel to unlock the drives, export as much of the content as possible and import the various exports so that we can ultimately copy off complete images. Really, really hoping to have a Merry Christmas... ;) PS: We got the 80 GB image out, it had a single 4MB object hole so we used ddrescue to read the source image forwards, rebooted the node when it stalled on the missing data and repeated the copy in reverse direction thereafter... Regards David Herselman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
pported : True SMBusAddress : Selected drive is in a disable logical state. SSCEnabled : False SanitizeBlockEraseSupported : False SanitizeCryptoScrambleSupported : True SanitizeSupported : True SataGen1 : True SataGen2 : True SataGen3 : True SataNegotiatedSpeed : Unknown SectorSize : 512 SecurityEnabled : False SecurityFrozen : False SecurityLocked : False SecuritySupported : False SerialNumber : PHYM7276031E1P9DGN TCGSupported : False TargetID : 0 TempThreshold : Selected drive is in a disable logical state. TemperatureLoggingInterval : Selected drive is in a disable logical state. TimeLimitedErrorRecovery : Selected drive is in a disable logical state. TrimSize : 4 TrimSupported : True VolatileWriteCacheEnabled : Selected drive is in a disable logical state. WWID : 3959312879584368077 WriteAtomicityDisableNormal : Selected drive is in a disable logical state. WriteCacheEnabled : True WriteCacheReorderingStateEnabled : Selected drive is in a disable logical state. WriteCacheState : Selected drive is in a disable logical state. WriteCacheSupported : True WriteErrorRecoveryTimer : Selected drive is in a disable logical state. SMART information is inaccessible, overall status is failed. Herewith the stats from a partner disc which was still working when the others failed: Device Model: INTEL SSDSC2KG019T7 Serial Number:PHYM727602TM1P9DGN LU WWN Device Id: 5 5cd2e4 14e1636bb Firmware Version: SCV10100 User Capacity:1,920,383,410,176 bytes [1.92 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate:Solid State Device Form Factor: 2.5 inches Device is:Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is:Mon Dec 18 19:33:51 2017 SAST SMART support is: Available - device has SMART capability. SMART support is: Enabled ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0032 100 100 000Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000Old_age Always - 98 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 3 170 Unknown_Attribute 0x0033 100 100 010Pre-fail Always - 0 171 Unknown_Attribute 0x0032 100 100 000Old_age Always - 1 172 Unknown_Attribute 0x0032 100 100 000Old_age Always - 0 174 Unknown_Attribute 0x0032 100 100 000Old_age Always - 0 175 Program_Fail_Count_Chip 0x0033 100 100 010Pre-fail Always - 17567121432 183 Runtime_Bad_Block 0x0032 100 100 000Old_age Always - 0 184 End-to-End_Error0x0033 100 100 090Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 077 076 000Old_age Always - 23 (Min/Max 17/29) 192 Power-Off_Retract_Count 0x0032 100 100 000Old_age Always - 0 194 Temperature_Celsius 0x0022 100 100 000Old_age Always - 23 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 0 199 UDMA_CRC_Error_Count0x003e 100 100 000Old_age Always - 0 225 Unknown_SSD_Attribute 0x0032 100 100 000Old_age Always - 14195 226 Unknown_SSD_Attribute 0x0032 100 100 000Old_age Always - 0 227 Unknown_SSD_Attribute 0x0032 100 100 000Old_age Always - 42 228 Power-off_Retract_Count 0x0032 100 100 000Old_age Always - 5905 232 Available_Reservd_Space 0x0033 100 100 010Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000Old_age Always - 0 234 Unknown_Attribute 0x0032 100 100 000Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000Old_age Always - 14195 242 Total_LBAs_Read 0x0032 100 100 000Old_age Always - 10422 243 Unknown_Attribute 0x0032 100 100 000Old_age Always - 41906 Media wear out : 0% used LBAs written: 14195 Power on hours: <100 Power cycle count: once at the factory, once at our offices to check if there was newer firmware (there wasn't) and once when we restarted the node to see if it could then access a failed drive. Regards David Herselman -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: Thursday, 21 December 2017 3:24 AM To: ceph-users@lists.ceph.com Cc: David Herselman Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate pgs? Hello, first off, I don't have anything to add to your co
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
covery/osd32_7.fs2.export systemctl stop ceph-osd@34 ceph-objectstore-tool --op import --pgid 7.4s5 --data-path /var/lib/ceph/osd/ceph-34 --journal-path /var/lib/ceph/osd/ceph-34/journal --file /var/lib/vz/template/ssd_recovery/osd34_7.4s5.export systemctl stop ceph-osd@34 ceph-objectstore-tool --op import --pgid 7.fs1 --data-path /var/lib/ceph/osd/ceph-34 --journal-path /var/lib/ceph/osd/ceph-34/journal --file /var/lib/vz/template/ssd_recovery/osd34_7.fs1.export Reset permissions and then started the OSDs: for OSD in 27 30 31 32 34; do chown -R ceph:ceph /var/lib/ceph/osd/ceph-$OSD; systemctl start ceph-osd@$OSD; done Then finally started all the OSDs... Now to hope that Intel have a way of accessing drives that are in a 'disable logical state'. The imports succeed, herewith a link to the output after running an import for placement group 7.4s2 on OSD 31: https://drive.google.com/open?id=1-Jo1jmrWrGLO2OgflacGPlEf2p32Y4hn Sample snippet: Write 1#7:fffcd2ec:::rbd_data.4.be8e9974b0dc51.2869:head# snapset 0=[]:{} Write 1#7:fffd4823:::rbd_data.4.ba24ef2ae8944a.a2b0:head# snapset 0=[]:{} Write 1#7:fffd6fb6:::benchmark_data_kvm5b_20945_object14722:head# snapset 0=[]:{} Write 1#7:a069:::rbd_data.4.ba24ef2ae8944a.aea9:head# snapset 0=[]:{} Import successful Data does get written, I can tell by the size of the FileStore mount points: [root@kvm5b ssd_recovery]# df -h | grep -P 'ceph-(27|30|31|32|34)$' /dev/sdd4 140G 5.2G 135G 4% /var/lib/ceph/osd/ceph-27 /dev/sdd7 140G 14G 127G 10% /var/lib/ceph/osd/ceph-30 /dev/sdd8 140G 14G 127G 10% /var/lib/ceph/osd/ceph-31 /dev/sdd9 140G 22G 119G 16% /var/lib/ceph/osd/ceph-32 /dev/sdd11 140G 22G 119G 16% /var/lib/ceph/osd/ceph-34 How do I tell Ceph to read these object shards? PS: It's probably a good idea to reweight the OSDs to 0 before starting again. This should prevent data flowing on to them, if they are not in a different device class or other crush selection ruleset. Ie: for OSD in `seq 24 35`; do ceph osd crush reweight osd.$OSD 0; done Regards David Herselman -Original Message- From: David Herselman Sent: Thursday, 21 December 2017 3:49 AM To: 'Christian Balzer' ; ceph-users@lists.ceph.com Subject: RE: [ceph-users] Many concurrent drive failures - How do I activate pgs? Hi Christian, Thanks for taking the time, I haven't been contacted by anyone yet but managed to get the down placement groups cleared by exporting 7.4s0 and 7.fs0 and then marking them as complete on the surviving OSDs: kvm5c: ceph-objectstore-tool --op export --pgid 7.4s0 --data-path /var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file /var/lib/vz/template/ssd_recovery/osd8_7.4s0.export; ceph-objectstore-tool --op mark-complete --data-path /var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --pgid 7.4s0; kvm5f: ceph-objectstore-tool --op export --pgid 7.fs0 --data-path /var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal --file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export; ceph-objectstore-tool --op mark-complete --data-path /var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal --pgid 7.fs0; This would presumably simply punch holes in the RBD images but at least we can copy them out of that pool and hope that Intel can somehow unlock the drives for us to then export/import objects. To answer your questions though, we have 6 near identical Intel Wildcat Pass 1U servers and have Proxmox loaded on them. Proxmox uses a Debian 9 base with the Ubuntu kernel, for which they apply cherry picked kernel patches (eg Intel NIC driver updates, vhost perf regression and mem-leak fixes, etc): kvm5a: Intel R1208WTTGSR System (serial: BQWS55091014) Intel S2600WTTR Motherboard (serial: BQWL54950385, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT disabled) 24 x Micron 8GB DDR4 2133MHz (24 x 18ASF1G72PZ-2G1B1) Intel AXX10GBNIA I/O Module kvm5b: Intel R1208WTTGS System (serial: BQWS53890178) Intel S2600WTT Motherboard (serial: BQWL52550359, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module kvm5c: Intel R1208WT2GS System (serial: BQWS50490279) Intel S2600WT2 Motherboard (serial: BQWL44650203, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v3 2.6GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module kvm5d: Intel R1208WTTGSR System (serial: BQWS62291318) Intel S2600WTTR Motherboard (serial: BQW
[ceph-users] Copy locked parent and clones to another pool
;rbd_ssd'; dst='rbd_hdd'; rbdsnap () { [ "x" = "$1"x ] && return 1; [ `rbd snap ls $1 | grep -Pc "^\s+\d+\s+$2\s"` -gt 0 ] && return 0 || return 1; } # Backup 'template-debian-9.3' (clone parent) - Should never change so no need to maintain snapshots or run it on a continual basis: #for ID in 210; do # for f in 1 2 3; do #echo -en "\t\t : Copying "$src"/base-"$ID"-disk-"$f"@__base__ to "$dst"/vm-"$ID"-disk-"$f"_backup"; #qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp -S 4M rbd:"$src"/base-"$ID"-disk-"$f"@__base__ rbd:"$dst"/vm-"$ID"-disk-"$f"_backup; # done #done # Backup images (clone children): for ID in 211 212 213 214; do for f in 1 2 3; do rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap3 && rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap2 && rbd snap rm "$dst"/vm-"$ID"-disk-"$f"_backup@snap3; rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap3 || rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap2 && rbd snap rename "$dst"/vm-"$ID"-disk-"$f"_backup@snap2 "$dst"/vm-"$ID"-disk-"$f"_backup@snap3; rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap2 || rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap1 && rbd snap rename "$dst"/vm-"$ID"-disk-"$f"_backup@snap1 "$dst"/vm-"$ID"-disk-"$f"_backup@snap2; rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap1 || rbd snap create "$dst"/vm-"$ID"-disk-"$f"_backup@snap1; rbd snap create "$src"/vm-"$ID"-disk-"$f"@backupinprogress; done for f in 1 2 3; do echo -en "\t\t : Copying "$src"/vm-"$ID"-disk-"$f" to "$dst"/vm-"$ID"-disk-"$f"_backup"; #qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp -S 4M rbd:"$src"/vm-"$ID"-disk-"$f"@backupinprogress rbd:"$dst"/vm-"$ID"-disk-"$f"_backup; export dev1=`rbd map "$src"/vm-"$ID"-disk-"$f@backupinprogress" --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring`; export dev2=`rbd map "$dst"/vm-"$ID"-disk-"$f"_backup --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring`; perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};print md5($_)' $dev2 | perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};$b=md5($_); read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 | perl -ne 'BEGIN{$/=\1} if ($_ eq"s") {$s++} else {if ($s) { seek STDOUT,$s*4194304,1; $s=0}; read ARGV,$buf,4194304; print $buf}' 1<> $dev2; rbd unmap $dev1; rbd unmap $dev2; rbd snap rm "$src"/vm-"$ID"-disk-"$f"@backupinprogress; done done Commenting out everything from 'export dev1' to 'rbd unmap $dev2' and uncommenting out the qemu-img command yields the following: real0m48.598s user0m14.583s sys 0m10.986s [admin@kvm5a ~]# rbd du rbd_hdd/vm-211-disk-3_backup NAME PROVISIONED USED vm-211-disk-3_backup@snap3 20480M 2764M vm-211-disk-3_backup@snap2 20480M 2764M vm-211-disk-3_backup@snap1 20480M 2764M vm-211-disk-3_backup20480M 2764M 20480M 11056M Repeating the copy using the Perl solution is much slower but as the VM is currently off nothing has changed and each snapshot consumes zero data: real1m49.000s user1m34.339s sys 0m17.847s [admin@kvm5a ~]# rbd du rbd_hdd/vm-211-disk-3_backup warning: fast-diff map is not enabled for vm-211-disk-3_backup. operation may be slow. NAME PROVISIONED USED vm-211-disk-3_backup@snap3 20480M 2764M vm-211-disk-3_backup@snap2 20480M 0 vm-211-disk-3_backup@snap1 20480M 0 vm-211-disk-3_backup20480M 0 20480M 2764M PS: Not if this that is a Ceph display bug, why would the snapshot base be reported as not consuming any data and the first snapshot (rotated to 'snap3') report all the usage? Purging all snapshots yields the following: [admin@kvm5a ~]# rbd du rbd_hdd/vm-211-disk-3_backup warning: fast-diff map is not enabled for vm-211-disk-3_backup. operation may be slow. NAME PROVISIONED USED vm-211-disk-3_backup 20480M 2764M Regards David Herselman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running Jewel and Luminous mixed for a longer period
Hi Travis, In my experience, after converting OSDs from hdd FileStore with ssd journals to hdd BlueStore with RocksDB and it's WAL on ssd, FileStore is significantly faster. https://forum.proxmox.com/threads/ceph-bluestore-not-always-faster-than-filestore.38405/ Pure ssd OSDs would however be much faster using BlueStore... Regards David Herselman On 29 Dec 2017 22:06, Travis Nielsen wrote: Since bluestore was declared stable in Luminous, is there any remaining scenario to use filestore in new deployments? Or is it safe to assume that bluestore is always better to use in Luminous? All documentation I can find points to bluestore being superior in all cases. Thanks, Travis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
Hi Sean, No, Intel’s feedback has been… Pathetic… I have yet to receive anything more than a request to ‘sign’ a non-disclosure agreement, to obtain beta firmware. No official answer as to whether or not one can logically unlock the drives, no answer to my question whether or not Intel publish serial numbers anywhere pertaining to recalled batches and no information pertaining to whether or not firmware updates would address any known issues. This with us being an accredited Intel Gold partner… We’ve returned the lot and ended up with 9/12 of the drives failing in the same manner. The replaced drives, which had different serial number ranges, also failed. Very frustrating is that the drives fail in a way that result in unbootable servers, unless one adds ‘rootdelay=240’ to the kernel. I would be interested to know what platform your drives were in and whether or not they were connected to a RAID module/card. PS: After much searching we’ve decided to order the NVMe conversion kit and have ordered HGST UltraStar SN200 2.5 inch SFF drives with a 3 DWPD rating. Regards David Herselman From: Sean Redmond [mailto:sean.redmo...@gmail.com] Sent: Thursday, 11 January 2018 12:45 AM To: David Herselman Cc: Christian Balzer ; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate pgs? Hi, I have a case where 3 out to 12 of these Intel S4600 2TB model failed within a matter of days after being burn-in tested then placed into production. I am interested to know, did you every get any further feedback from the vendor on your issue? Thanks On Thu, Dec 21, 2017 at 1:38 PM, David Herselman mailto:d...@syrex.co>> wrote: Hi, I assume this can only be a physical manufacturing flaw or a firmware bug? Do Intel publish advisories on recalled equipment? Should others be concerned about using Intel DC S4600 SSD drives? Could this be an electrical issue on the Hot Swap Backplane or BMC firmware issue? Either way, all pure Intel... The hole is only 1.3 GB (4 MB x 339 objects) but perfectly striped through images, file systems are subsequently severely damaged. Is it possible to get Ceph to read in partial data shards? It would provide between 25-75% more yield... Is there anything wrong with how we've proceeded thus far? Would be nice to reference examples of using ceph-objectstore-tool but documentation is virtually non-existent. We used another SSD drive to simulate bringing all the SSDs back online. We carved up the drive to provide equal partitions to essentially simulate the original SSDs: # Partition a drive to provide 12 x 150GB partitions, eg: sdd 8:48 0 1.8T 0 disk |-sdd18:49 0 140G 0 part |-sdd28:50 0 140G 0 part |-sdd38:51 0 140G 0 part |-sdd48:52 0 140G 0 part |-sdd58:53 0 140G 0 part |-sdd68:54 0 140G 0 part |-sdd78:55 0 140G 0 part |-sdd88:56 0 140G 0 part |-sdd98:57 0 140G 0 part |-sdd10 8:58 0 140G 0 part |-sdd11 8:59 0 140G 0 part +-sdd12 8:60 0 140G 0 part Pre-requisites: ceph osd set noout; apt-get install uuid-runtime; for ID in `seq 24 35`; do UUID=`uuidgen`; OSD_SECRET=`ceph-authtool --gen-print-key`; DEVICE='/dev/sdd'$[$ID-23]; # 24-23 = /dev/sdd1, 35-23 = /dev/sdd12 echo "{\"cephx_secret\": \"$OSD_SECRET\"}" | ceph osd new $UUID $ID -i - -n client.bootstrap-osd -k /var/lib/ceph/bootstrap-osd/ceph.keyring; mkdir /var/lib/ceph/osd/ceph-$ID; mkfs.xfs $DEVICE; mount $DEVICE /var/lib/ceph/osd/ceph-$ID; ceph-authtool --create-keyring /var/lib/ceph/osd/ceph-$ID/keyring --name osd.$ID --add-key $OSD_SECRET; ceph-osd -i $ID --mkfs --osd-uuid $UUID; chown -R ceph:ceph /var/lib/ceph/osd/ceph-$ID; systemctl enable ceph-osd@$ID; systemctl start ceph-osd@$ID; done Once up we imported previous exports of empty head files in to 'real' OSDs: kvm5b: systemctl stop ceph-osd@8; ceph-objectstore-tool --op import --pgid 7.4s0 --data-path /var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file /var/lib/vz/template/ssd_recovery/osd8_7.4s0.export; chown ceph:ceph -R /var/lib/ceph/osd/ceph-8; systemctl start ceph-osd@8; kvm5f: systemctl stop ceph-osd@23; ceph-objectstore-tool --op import --pgid 7.fs0 --data-path /var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal --file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export; chown ceph:ceph -R /var/lib/ceph/osd/ceph-23; systemctl start ceph-osd@23; Bulk import previously exported objects: cd /var/lib/vz/template/ssd_recovery; for FILE in `ls -1A osd*_*.export | grep -Pv '^osd(8|23)_'`; do OSD=`echo $FILE | perl -pe 's/^osd(\d+).*/\1/'`; PGID=`echo $FILE | perl -pe &
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
Hi Mike, I eventually got hold of a customer relations manager at Intel but his attitude was lack luster and Intel never officially responded to any correspondence we sent them. The Intel s4600 drives all passed our standard burn-in tests, they exclusively appear to fail once they handle production BlueStore usage, generally after a couple days use. Intel really didn’t seem interested, even after explaining that the drives were in different physical systems in different data centres and that I had been in contact with another Intel customer who had experienced similar failures in Dell equipment (our servers are pure Intel). Perhaps there’s interest in a Lawyer picking up the issue and their attitude. Not advising customers of a known issue which leads to data loss is simply negligent, especially on a product that they tout as being more reliable than spinners and has their Data Centre reliability stamp. I returned the lot and am done with Intel SSDs, will advise as many customers and peers to do the same… Regards David Herselman From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mike Lovell Sent: Thursday, 22 February 2018 11:19 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate pgs? has anyone tried with the most recent firmwares from intel? i've had a number of s4600 960gb drives that have been waiting for me to get around to adding them to a ceph cluster. this as well as having 2 die almost simultaneously in a different storage box is giving me pause. i noticed that David listed some output showing his ssds were running firmware version SCV10100. the drives i have came with the same one. it looks like SCV10111 is available through the latest isdct package. i'm working through upgrading mine and attempting some burn in testing. just curious if anyone has had any luck there. mike On Thu, Feb 22, 2018 at 9:49 AM, Chris Sarginson mailto:csarg...@gmail.com>> wrote: Hi Caspar, Sean and I replaced the problematic DC S4600 disks (after all but one had failed) in our cluster with Samsung SM863a disks. There was an NDA for new Intel firmware (as mentioned earlier in the thread by David) but given the problems we were experiencing we moved all Intel disks to a single failure domain but were unable to get to deploy additional firmware to test. The Samsung should fit your requirements. http://www.samsung.com/semiconductor/minisite/ssd/product/enterprise/sm863a/ Regards Chris On Thu, 22 Feb 2018 at 12:50 Caspar Smit mailto:caspars...@supernas.eu>> wrote: Hi Sean and David, Do you have any follow ups / news on the Intel DC S4600 case? We are looking into this drives to use as DB/WAL devices for a new to be build cluster. Did Intel provide anything (like new firmware) which should fix the issues you were having or are these drives still unreliable? At the moment we are also looking into the Intel DC S3610 as an alternative which are a step back in performance but should be very reliable. Maybe any other recommendations for a ~200GB 2,5" SATA SSD to use as DB/WAL? (Aiming for ~3 DWPD should be sufficient for DB/WAL?) Kind regards, Caspar 2018-01-12 15:45 GMT+01:00 Sean Redmond mailto:sean.redmo...@gmail.com>>: Hi David, To follow up on this I had a 4th drive fail (out of 12) and have opted to order the below disks as a replacement, I have an ongoing case with Intel via the supplier - Will report back anything useful - But I am going to avoid the Intel s4600 2TB SSD's for the moment. 1.92TB Samsung SM863a 2.5" Enterprise SSD, SATA3 6Gb/s, 2-bit MLC V-NAND Regards Sean Redmond On Wed, Jan 10, 2018 at 11:08 PM, Sean Redmond mailto:sean.redmo...@gmail.com>> wrote: Hi David, Thanks for your email, they are connected inside Dell R730XD (2.5 inch 24 disk model) in None RAID mode via a perc RAID card. The version of ceph is Jewel with kernel 4.13.X and ubuntu 16.04. Thanks for your feedback on the HGST disks. Thanks On Wed, Jan 10, 2018 at 10:55 PM, David Herselman mailto:d...@syrex.co>> wrote: Hi Sean, No, Intel’s feedback has been… Pathetic… I have yet to receive anything more than a request to ‘sign’ a non-disclosure agreement, to obtain beta firmware. No official answer as to whether or not one can logically unlock the drives, no answer to my question whether or not Intel publish serial numbers anywhere pertaining to recalled batches and no information pertaining to whether or not firmware updates would address any known issues. This with us being an accredited Intel Gold partner… We’ve returned the lot and ended up with 9/12 of the drives failing in the same manner. The replaced drives, which had different serial number ranges, also failed. Very frustrating is that the drives fail in a way that result in unbootable servers, unless one adds ‘rootdelay=240’ to the kernel. I would be interested to know wha
Re: [ceph-users] Ceph Nautilus - can't balance due to degraded state
s 8.338 [303,305] pg_upmap_items 8.339 [303,302] pg_upmap_items 8.33a [404,400] pg_upmap_items 8.33c [304,305] pg_upmap_items 8.344 [404,403] pg_upmap_items 8.346 [201,204] pg_upmap_items 8.349 [504,503] pg_upmap_items 8.350 [501,500] pg_upmap_items 8.356 [101,102] pg_upmap_items 8.358 [404,405] pg_upmap_items 8.363 [103,102] pg_upmap_items 8.364 [404,403] pg_upmap_items 8.366 [404,403] pg_upmap_items 8.369 [304,305] pg_upmap_items 8.36b [103,102] pg_upmap_items 8.373 [404,403] pg_upmap_items 8.383 [404,403] pg_upmap_items 8.39d [203,205] pg_upmap_items 8.3a3 [103,102] pg_upmap_items 8.3a6 [304,305] pg_upmap_items 8.3ab [304,305] pg_upmap_items 8.3af [304,305] pg_upmap_items 8.3b3 [404,405] pg_upmap_items 8.3b4 [303,305] pg_upmap_items 8.3b7 [404,403] pg_upmap_items 8.3b9 [404,403] pg_upmap_items 8.3ba [404,403,201,205] pg_upmap_items 8.3bd [404,405] pg_upmap_items 8.3c0 [304,305] pg_upmap_items 8.3c3 [404,403] pg_upmap_items 8.3ca [404,403] pg_upmap_items 8.3cf [404,405] pg_upmap_items 8.3d0 [404,405] pg_upmap_items 8.3da [404,403] pg_upmap_items 8.3e4 [404,405] pg_upmap_items 8.3ea [404,405] pg_upmap_items 8.3ec [203,205] pg_upmap_items 8.3f3 [501,505] pg_upmap_items 8.3f7 [304,305] pg_upmap_items 8.3fb [404,405] pg_upmap_items 8.3fc [304,305] pg_upmap_items 8.400 [105,102,404,403] pg_upmap_items 8.409 [404,403] pg_upmap_items 8.40b [103,102,404,405] pg_upmap_items 8.40c [404,400] pg_upmap_items 8.410 [404,403] pg_upmap_items 8.411 [404,405] pg_upmap_items 8.417 [404,403] pg_upmap_items 8.418 [404,403] pg_upmap_items 9.2 [10401,10400] pg_upmap_items 9.9 [10200,10201] Regards David Herselman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problem formatting erasure coded image
Hi, I'm seeing errors in Windows VM guests's event logs, for example: The IO operation at logical block address 0x607bf7 for Disk 1 (PDO name \Device\001e) was retried Log Name: System Source: Disk Event ID: 153 Level: Warning Initialising the disk to use GPT is successful but attempting to create a standard NTFS volume eventually times out and fails. Pretty sure this is in production in numerous environments, so I must be doing something wrong... Could anyone please validate that a rbd cached erasure coded image can be used as a Windows VM data disc? Running Ceph Nautilus 14.2.4 with kernel 5.0.21 Created new erasure coded pool backed by spinners and a new replicated ssd pool for metadata: ceph osd erasure-code-profile set ec32_hdd \ plugin=jerasure k=3 m=2 technique=reed_sol_van \ crush-root=default crush-failure-domain=host crush-device-class=hdd \ directory=/usr/lib/ceph/erasure-code; ceph osd pool create ec_hdd 64 erasure ec32_hdd; ceph osd pool set ec_hdd allow_ec_overwrites true; ceph osd pool application enable ec_hdd rbd; ceph osd crush rule create-replicated replicated_ssd default host ssd; ceph osd pool create rbd_ssd 64 64 replicated replicated_ssd; ceph osd pool application enable rbd_ssd rbd; rbd create rbd_ssd/surveylance-recordings --size 1T --data-pool ec_hdd; Added a caching tier: ceph osd pool create ec_hdd_cache 64 64 replicated replicated_ssd; ceph osd tier add ec_hdd ec_hdd_cache; ceph osd tier cache-mode ec_hdd_cache writeback; ceph osd tier set-overlay ec_hdd ec_hdd_cache; ceph osd pool set ec_hdd_cache hit_set_type bloom; ceph osd pool set ec_hdd_cache hit_set_count 12 ceph osd pool set ec_hdd_cache hit_set_period 14400 ceph osd pool set ec_hdd_cache target_max_bytes $[128*1024*1024*1024] ceph osd pool set ec_hdd_cache min_read_recency_for_promote 2 ceph osd pool set ec_hdd_cache min_write_recency_for_promote 2 ceph osd pool set ec_hdd_cache cache_target_dirty_ratio 0.4 ceph osd pool set ec_hdd_cache cache_target_dirty_high_ratio 0.6 ceph osd pool set ec_hdd_cache cache_target_full_ratio 0.8 Image appears to have been created correctly: rbd ls rbd_ssd -l NAME SIZE PARENT FMT PROT LOCK surveylance-recordings 1 TiB 2 rbd info rbd_ssd/surveylance-recordings rbd image 'surveylance-recordings': size 1 TiB in 262144 objects order 22 (4 MiB objects) snapshot_count: 0 id: 7341cc54df71f data_pool: ec_hdd block_name_prefix: rbd_data.2.7341cc54df71f format: 2 features: layering, data-pool op_features: flags: create_timestamp: Sun Sep 22 17:47:30 2019 access_timestamp: Sun Sep 22 17:47:30 2019 modify_timestamp: Sun Sep 22 17:47:30 2019 Ceph appears healthy: ceph -s cluster: id: 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782 health: HEALTH_OK services: mon: 3 daemons, quorum kvm1a,kvm1b,kvm1c (age 5d) mgr: kvm1c(active, since 5d), standbys: kvm1b, kvm1a mds: cephfs:1 {0=kvm1c=up:active} 2 up:standby osd: 24 osds: 24 up (since 4d), 24 in (since 4d) data: pools: 9 pools, 417 pgs objects: 325.04k objects, 1.1 TiB usage: 3.3 TiB used, 61 TiB / 64 TiB avail pgs: 417 active+clean io: client: 25 KiB/s rd, 2.7 MiB/s wr, 17 op/s rd, 306 op/s wr cache:0 op/s promote ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd62 TiB 59 TiB 2.9 TiB 2.9 TiB 4.78 ssd 2.4 TiB 2.1 TiB 303 GiB 309 GiB 12.36 TOTAL 64 TiB 61 TiB 3.2 TiB 3.3 TiB 5.07 POOLS: POOL ID STORED OBJECTS USED%USED MAX AVAIL rbd_hdd1 995 GiB 289.54k 2.9 TiB 5.23 18 TiB rbd_ssd217 B 4 48 KiB 0 666 GiB rbd_hdd_cache 3 99 GiB 34.91k 302 GiB 13.13 666 GiB cephfs_data4 2.1 GiB 526 6.4 GiB 0.01 18 TiB cephfs_metadata5 767 KiB 22 3.7 MiB 0 18 TiB device_health_metrics 6 5.9 MiB 24 5.9 MiB 0 18 TiB ec_hdd10 4.0 MiB 3 7.5 MiB 0 32 TiB ec_hdd_cache 11 67 MiB 30 200 MiB 0 666 GiB Regards David Herselman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph assimilated configuration - unable to remove item
Hi, We assimilated our Ceph configuration to store attributes within Ceph itself and subsequently have a minimal configuration file. Whilst this works perfectly we are unable to remove configuration entries populated by the assimilate-conf command. Ceph Nautilus 14.2.4.1 upgrade notes: cd /etc/pve; ceph config assimilate-conf -i ceph.conf -o ceph.conf.new; mv ceph.conf.new ceph.conf; pico /etc/ceph/ceph.conf # add back: cluster_network # public_network ceph config rm global cluster_network; ceph config rm global public_network; ceph config set global mon_osd_down_out_subtree_limit host; Resulting minimal Ceph configuration file: [admin@kvm1c ~]# cat /etc/ceph/ceph.conf [global] cluster_network = 10.248.1.0/24 filestore_xattr_use_omap = true fsid = 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782 mon_host = 10.248.1.60 10.248.1.61 10.248.1.62 public_network = 10.248.1.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring Ceph configuration entries: [admin@kvm1c ~]# ceph config dump WHOMASK LEVELOPTION VALUE RO global advanced auth_client_required cephx * global advanced auth_cluster_required cephx * global advanced auth_service_required cephx * global advanced cluster_network10.248.1.0/24 * global advanced debug_filestore0/0 global advanced debug_journal 0/0 global advanced debug_ms 0/0 global advanced debug_osd 0/0 global basicdevice_failure_prediction_mode cloud global advanced mon_allow_pool_delete true global advanced mon_osd_down_out_subtree_limit host global advanced osd_deep_scrub_interval1209600.00 global advanced osd_pool_default_min_size 2 global advanced osd_pool_default_size 3 global advanced osd_scrub_begin_hour 19 global advanced osd_scrub_end_hour 6 global advanced osd_scrub_sleep0.10 global advanced public_network 10.248.1.0/24 * global advanced rbd_default_features 7 global advanced rbd_default_features 31 mgr advanced mgr/balancer/activetrue mgr advanced mgr/balancer/mode upmap mgr advanced mgr/devicehealth/enable_monitoring true Note the duplicate 'rdb_default_features' entry. We've switched to kernel 5.3 which supports object-map and fast-diff and subsequently wanted to change the default features for new RBD images to reflect this. Commands we entered to get here: [admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features WHOMASK LEVELOPTION VALUE RO global advanced rbd_default_features 7 [admin@kvm1b ~]# ceph config rm global rbd_default_features [admin@kvm1b ~]# ceph config rm global rbd_default_features [admin@kvm1b ~]# ceph config rm global rbd_default_features [admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features WHOMASK LEVELOPTION VALUE RO global advanced rbd_default_features 7 [admin@kvm1b ~]# ceph config set global rbd_default_features 31 [admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features WHOMASK LEVELOPTION VALUE RO global advanced rbd_default_features 7 global advanced rbd_default_features 31 Regards David Herselman ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph assimilated configuration - unable to remove item
Hi, I've logged a bug report (https://tracker.ceph.com/issues/43296?next_issue_id=43295&prev_issue_id=43297) and Alwin from Proxmox was kind enough to provide a work around: ceph config rm global rbd_default_features; ceph config-key rm config/global/rbd_default_features; ceph config set global rbd_default_features 31; ceph config dump | grep -e WHO -e rbd_default_features; WHOMASK LEVELOPTION VALUE RO global advanced rbd_default_features 31 Regards David Herselman -Original Message- From: Stefan Kooman Sent: Wednesday, 11 December 2019 3:05 PM To: David Herselman Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph assimilated configuration - unable to remove item Quoting David Herselman (d...@syrex.co): > Hi, > > We assimilated our Ceph configuration to store attributes within Ceph > itself and subsequently have a minimal configuration file. Whilst this > works perfectly we are unable to remove configuration entries > populated by the assimilate-conf command. I forgot about this issue, but I encountered this when we upgraded to mimic. I can confirm this bug. It's possible to have the same key present with different values. For our production cluster we decided to stick to ceph.conf for the time being. That's also the workaround for now if you want to override the config store: just put that in your config file and reboot the daemon(s). Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com