Re: [ceph-users] Ceph assimilated configuration - unable to remove item

2019-12-13 Thread David Herselman
Hi,

I've logged a bug report 
(https://tracker.ceph.com/issues/43296?next_issue_id=43295_issue_id=43297) 
and Alwin from Proxmox was kind enough to provide a work around:
ceph config rm global rbd_default_features;
ceph config-key rm config/global/rbd_default_features;
ceph config set global rbd_default_features 31;

ceph config dump | grep -e WHO -e rbd_default_features;
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   31


Regards
David Herselman

-Original Message-
From: Stefan Kooman  
Sent: Wednesday, 11 December 2019 3:05 PM
To: David Herselman 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph assimilated configuration - unable to remove item

Quoting David Herselman (d...@syrex.co):
> Hi,
> 
> We assimilated our Ceph configuration to store attributes within Ceph 
> itself and subsequently have a minimal configuration file. Whilst this 
> works perfectly we are unable to remove configuration entries 
> populated by the assimilate-conf command.

I forgot about this issue, but I encountered this when we upgraded to mimic. I 
can confirm this bug. It's possible to have the same key present with different 
values. For our production cluster we decided to stick to ceph.conf for the 
time being. That's also the workaround for now if you want to override the 
config store: just put that in your config file and reboot the daemon(s).

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph assimilated configuration - unable to remove item

2019-12-11 Thread David Herselman
Hi,

We assimilated our Ceph configuration to store attributes within Ceph itself 
and subsequently have a minimal configuration file. Whilst this works perfectly 
we are unable to remove configuration entries populated by the assimilate-conf 
command.

Ceph Nautilus 14.2.4.1 upgrade notes:
cd /etc/pve;
ceph config assimilate-conf -i ceph.conf -o ceph.conf.new;
mv ceph.conf.new ceph.conf;
pico /etc/ceph/ceph.conf
  # add back: cluster_network
  #   public_network
ceph config rm global cluster_network;
ceph config rm global public_network;
ceph config set global mon_osd_down_out_subtree_limit host;

Resulting minimal Ceph configuration file:
[admin@kvm1c ~]# cat /etc/ceph/ceph.conf
[global]
 cluster_network = 10.248.1.0/24
 filestore_xattr_use_omap = true
 fsid = 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782
 mon_host = 10.248.1.60 10.248.1.61 10.248.1.62
 public_network = 10.248.1.0/24

[client]
 keyring = /etc/pve/priv/$cluster.$name.keyring

Ceph configuration entries:
[admin@kvm1c ~]# ceph config dump
WHOMASK LEVELOPTION VALUE  RO
global  advanced auth_client_required   cephx  *
global  advanced auth_cluster_required  cephx  *
global  advanced auth_service_required  cephx  *
global  advanced cluster_network10.248.1.0/24  *
global  advanced debug_filestore0/0
global  advanced debug_journal  0/0
global  advanced debug_ms   0/0
global  advanced debug_osd  0/0
global  basicdevice_failure_prediction_mode cloud
global  advanced mon_allow_pool_delete  true
global  advanced mon_osd_down_out_subtree_limit host
global  advanced osd_deep_scrub_interval1209600.00
global  advanced osd_pool_default_min_size  2
global  advanced osd_pool_default_size  3
global  advanced osd_scrub_begin_hour   19
global  advanced osd_scrub_end_hour 6
global  advanced osd_scrub_sleep0.10
global  advanced public_network 10.248.1.0/24  *
global  advanced rbd_default_features   7
global  advanced rbd_default_features   31
  mgr   advanced mgr/balancer/activetrue
  mgr   advanced mgr/balancer/mode  upmap
  mgr   advanced mgr/devicehealth/enable_monitoring true

Note the duplicate 'rdb_default_features' entry. We've switched to kernel 5.3 
which supports object-map and fast-diff and subsequently wanted to change the 
default features for new RBD images to reflect this.

Commands we entered to get here:
[admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   7

[admin@kvm1b ~]# ceph config rm global rbd_default_features
[admin@kvm1b ~]# ceph config rm global rbd_default_features
[admin@kvm1b ~]# ceph config rm global rbd_default_features

[admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   7

[admin@kvm1b ~]# ceph config set global rbd_default_features 31
[admin@kvm1b ~]# ceph config dump | grep -e WHO -e rbd_default_features
WHOMASK LEVELOPTION VALUE  RO
global  advanced rbd_default_features   7
global  advanced rbd_default_features   31



Regards
David Herselman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem formatting erasure coded image

2019-09-22 Thread David Herselman
Hi,

I'm seeing errors in Windows VM guests's event logs, for example:
The IO operation at logical block address 0x607bf7 for Disk 1 (PDO name 
\Device\001e) was retried
Log Name: System
Source: Disk
Event ID: 153
Level: Warning

Initialising the disk to use GPT is successful but attempting to create a 
standard NTFS volume eventually times out and fails.


Pretty sure this is in production in numerous environments, so I must be doing 
something wrong... Could anyone please validate that a rbd cached erasure coded 
image can be used as a Windows VM data disc?


Running Ceph Nautilus 14.2.4 with kernel 5.0.21

Created new erasure coded pool backed by spinners and a new replicated ssd pool 
for metadata:
ceph osd erasure-code-profile set ec32_hdd \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=hdd \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_hdd 64 erasure ec32_hdd;
ceph osd pool set ec_hdd allow_ec_overwrites true;
ceph osd pool application enable ec_hdd rbd;

ceph osd crush rule create-replicated replicated_ssd default host ssd;
ceph osd pool create rbd_ssd 64 64 replicated replicated_ssd;
ceph osd pool application enable rbd_ssd rbd;

rbd create rbd_ssd/surveylance-recordings --size 1T --data-pool ec_hdd;

Added a caching tier:
ceph osd pool create ec_hdd_cache 64 64 replicated replicated_ssd;
ceph osd tier add ec_hdd ec_hdd_cache;
ceph osd tier cache-mode ec_hdd_cache writeback;
ceph osd tier set-overlay ec_hdd ec_hdd_cache;
ceph osd pool set ec_hdd_cache hit_set_type bloom;

ceph osd pool set ec_hdd_cache hit_set_count 12
ceph osd pool set ec_hdd_cache hit_set_period 14400
ceph osd pool set ec_hdd_cache target_max_bytes $[128*1024*1024*1024]
ceph osd pool set ec_hdd_cache min_read_recency_for_promote 2
ceph osd pool set ec_hdd_cache min_write_recency_for_promote 2
ceph osd pool set ec_hdd_cache cache_target_dirty_ratio 0.4
ceph osd pool set ec_hdd_cache cache_target_dirty_high_ratio 0.6
ceph osd pool set ec_hdd_cache cache_target_full_ratio 0.8


Image appears to have been created correctly:
rbd ls rbd_ssd -l
NAME   SIZE  PARENT FMT PROT LOCK
surveylance-recordings 1 TiB  2

rbd info rbd_ssd/surveylance-recordings
rbd image 'surveylance-recordings':
size 1 TiB in 262144 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 7341cc54df71f
data_pool: ec_hdd
block_name_prefix: rbd_data.2.7341cc54df71f
format: 2
features: layering, data-pool
op_features:
flags:
create_timestamp: Sun Sep 22 17:47:30 2019
access_timestamp: Sun Sep 22 17:47:30 2019
modify_timestamp: Sun Sep 22 17:47:30 2019

Ceph appears healthy:
ceph -s
  cluster:
id: 31f6ea46-12cb-47e8-a6f3-60fb6bbd1782
health: HEALTH_OK

  services:
mon: 3 daemons, quorum kvm1a,kvm1b,kvm1c (age 5d)
mgr: kvm1c(active, since 5d), standbys: kvm1b, kvm1a
mds: cephfs:1 {0=kvm1c=up:active} 2 up:standby
osd: 24 osds: 24 up (since 4d), 24 in (since 4d)

  data:
pools:   9 pools, 417 pgs
objects: 325.04k objects, 1.1 TiB
usage:   3.3 TiB used, 61 TiB / 64 TiB avail
pgs: 417 active+clean

  io:
client:   25 KiB/s rd, 2.7 MiB/s wr, 17 op/s rd, 306 op/s wr
cache:0 op/s promote

ceph df
  RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd62 TiB  59 TiB 2.9 TiB  2.9 TiB  4.78
ssd   2.4 TiB 2.1 TiB 303 GiB  309 GiB 12.36
TOTAL  64 TiB  61 TiB 3.2 TiB  3.3 TiB  5.07

  POOLS:
POOL  ID STORED  OBJECTS USED%USED  
   MAX AVAIL
rbd_hdd1 995 GiB 289.54k 2.9 TiB  5.23  
  18 TiB
rbd_ssd217 B   4  48 KiB 0  
 666 GiB
rbd_hdd_cache  3  99 GiB  34.91k 302 GiB 13.13  
 666 GiB
cephfs_data4 2.1 GiB 526 6.4 GiB  0.01  
  18 TiB
cephfs_metadata5 767 KiB  22 3.7 MiB 0  
  18 TiB
device_health_metrics  6 5.9 MiB  24 5.9 MiB 0  
  18 TiB
ec_hdd10 4.0 MiB   3 7.5 MiB 0  
  32 TiB
ec_hdd_cache  11  67 MiB  30 200 MiB 0  
 666 GiB



Regards
David Herselman

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus - can't balance due to degraded state

2019-08-03 Thread David Herselman
tems 8.33c [304,305]
pg_upmap_items 8.344 [404,403]
pg_upmap_items 8.346 [201,204]
pg_upmap_items 8.349 [504,503]
pg_upmap_items 8.350 [501,500]
pg_upmap_items 8.356 [101,102]
pg_upmap_items 8.358 [404,405]
pg_upmap_items 8.363 [103,102]
pg_upmap_items 8.364 [404,403]
pg_upmap_items 8.366 [404,403]
pg_upmap_items 8.369 [304,305]
pg_upmap_items 8.36b [103,102]
pg_upmap_items 8.373 [404,403]
pg_upmap_items 8.383 [404,403]
pg_upmap_items 8.39d [203,205]
pg_upmap_items 8.3a3 [103,102]
pg_upmap_items 8.3a6 [304,305]
pg_upmap_items 8.3ab [304,305]
pg_upmap_items 8.3af [304,305]
pg_upmap_items 8.3b3 [404,405]
pg_upmap_items 8.3b4 [303,305]
pg_upmap_items 8.3b7 [404,403]
pg_upmap_items 8.3b9 [404,403]
pg_upmap_items 8.3ba [404,403,201,205]
pg_upmap_items 8.3bd [404,405]
pg_upmap_items 8.3c0 [304,305]
pg_upmap_items 8.3c3 [404,403]
pg_upmap_items 8.3ca [404,403]
pg_upmap_items 8.3cf [404,405]
pg_upmap_items 8.3d0 [404,405]
pg_upmap_items 8.3da [404,403]
pg_upmap_items 8.3e4 [404,405]
pg_upmap_items 8.3ea [404,405]
pg_upmap_items 8.3ec [203,205]
pg_upmap_items 8.3f3 [501,505]
pg_upmap_items 8.3f7 [304,305]
pg_upmap_items 8.3fb [404,405]
pg_upmap_items 8.3fc [304,305]
pg_upmap_items 8.400 [105,102,404,403]
pg_upmap_items 8.409 [404,403]
pg_upmap_items 8.40b [103,102,404,405]
pg_upmap_items 8.40c [404,400]
pg_upmap_items 8.410 [404,403]
pg_upmap_items 8.411 [404,405]
pg_upmap_items 8.417 [404,403]
pg_upmap_items 8.418 [404,403]
pg_upmap_items 9.2 [10401,10400]
pg_upmap_items 9.9 [10200,10201]


Regards
David Herselman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-02-22 Thread David Herselman
Hi Mike,

I eventually got hold of a customer relations manager at Intel but his attitude 
was lack luster and Intel never officially responded to any correspondence we 
sent them. The Intel s4600 drives all passed our standard burn-in tests, they 
exclusively appear to fail once they handle production BlueStore usage, 
generally after a couple days use.

Intel really didn’t seem interested, even after explaining that the drives were 
in different physical systems in different data centres and that I had been in 
contact with another Intel customer who had experienced similar failures in 
Dell equipment (our servers are pure Intel).


Perhaps there’s interest in a Lawyer picking up the issue and their attitude. 
Not advising customers of a known issue which leads to data loss is simply 
negligent, especially on a product that they tout as being more reliable than 
spinners and has their Data Centre reliability stamp.

I returned the lot and am done with Intel SSDs, will advise as many customers 
and peers to do the same…


Regards
David Herselman


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mike 
Lovell
Sent: Thursday, 22 February 2018 11:19 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate 
pgs?

has anyone tried with the most recent firmwares from intel? i've had a number 
of s4600 960gb drives that have been waiting for me to get around to adding 
them to a ceph cluster. this as well as having 2 die almost simultaneously in a 
different storage box is giving me pause. i noticed that David listed some 
output showing his ssds were running firmware version SCV10100. the drives i 
have came with the same one. it looks like SCV10111 is available through the 
latest isdct package. i'm working through upgrading mine and attempting some 
burn in testing. just curious if anyone has had any luck there.

mike

On Thu, Feb 22, 2018 at 9:49 AM, Chris Sarginson 
<csarg...@gmail.com<mailto:csarg...@gmail.com>> wrote:
Hi Caspar,

Sean and I replaced the problematic DC S4600 disks (after all but one had 
failed) in our cluster with Samsung SM863a disks.
There was an NDA for new Intel firmware (as mentioned earlier in the thread by 
David) but given the problems we were experiencing we moved all Intel disks to 
a single failure domain but were unable to get to deploy additional firmware to 
test.

The Samsung should fit your requirements.

http://www.samsung.com/semiconductor/minisite/ssd/product/enterprise/sm863a/

Regards
Chris

On Thu, 22 Feb 2018 at 12:50 Caspar Smit 
<caspars...@supernas.eu<mailto:caspars...@supernas.eu>> wrote:
Hi Sean and David,

Do you have any follow ups / news on the Intel DC S4600 case? We are looking 
into this drives to use as DB/WAL devices for a new to be build cluster.

Did Intel provide anything (like new firmware) which should fix the issues you 
were having or are these drives still unreliable?

At the moment we are also looking into the Intel DC S3610 as an alternative 
which are a step back in performance but should be very reliable.

Maybe any other recommendations for a ~200GB 2,5" SATA SSD to use as DB/WAL? 
(Aiming for ~3 DWPD should be sufficient for DB/WAL?)

Kind regards,
Caspar

2018-01-12 15:45 GMT+01:00 Sean Redmond 
<sean.redmo...@gmail.com<mailto:sean.redmo...@gmail.com>>:
Hi David,

To follow up on this I had a 4th drive fail (out of 12) and have opted to order 
the below disks as a replacement, I have an ongoing case with Intel via the 
supplier - Will report back anything useful - But I am going to avoid the Intel 
s4600 2TB SSD's for the moment.

1.92TB Samsung SM863a 2.5" Enterprise SSD, SATA3 6Gb/s, 2-bit MLC V-NAND

Regards
Sean Redmond

On Wed, Jan 10, 2018 at 11:08 PM, Sean Redmond 
<sean.redmo...@gmail.com<mailto:sean.redmo...@gmail.com>> wrote:
Hi David,

Thanks for your email, they are connected inside Dell R730XD (2.5 inch 24 disk 
model) in None RAID mode via a perc RAID card.

The version of ceph is Jewel with kernel 4.13.X and ubuntu 16.04.

Thanks for your feedback on the HGST disks.

Thanks

On Wed, Jan 10, 2018 at 10:55 PM, David Herselman 
<d...@syrex.co<mailto:d...@syrex.co>> wrote:
Hi Sean,

No, Intel’s feedback has been… Pathetic… I have yet to receive anything more 
than a request to ‘sign’ a non-disclosure agreement, to obtain beta firmware. 
No official answer as to whether or not one can logically unlock the drives, no 
answer to my question whether or not Intel publish serial numbers anywhere 
pertaining to recalled batches and no information pertaining to whether or not 
firmware updates would address any known issues.

This with us being an accredited Intel Gold partner…


We’ve returned the lot and ended up with 9/12 of the drives failing in the same 
manner. The replaced drives, which had different serial number ranges, also 
failed. Very frustrating is that the drives fail in a 

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-01-10 Thread David Herselman
Hi Sean,

No, Intel’s feedback has been… Pathetic… I have yet to receive anything more 
than a request to ‘sign’ a non-disclosure agreement, to obtain beta firmware. 
No official answer as to whether or not one can logically unlock the drives, no 
answer to my question whether or not Intel publish serial numbers anywhere 
pertaining to recalled batches and no information pertaining to whether or not 
firmware updates would address any known issues.

This with us being an accredited Intel Gold partner…


We’ve returned the lot and ended up with 9/12 of the drives failing in the same 
manner. The replaced drives, which had different serial number ranges, also 
failed. Very frustrating is that the drives fail in a way that result in 
unbootable servers, unless one adds ‘rootdelay=240’ to the kernel.


I would be interested to know what platform your drives were in and whether or 
not they were connected to a RAID module/card.

PS: After much searching we’ve decided to order the NVMe conversion kit and 
have ordered HGST UltraStar SN200 2.5 inch SFF drives with a 3 DWPD rating.


Regards
David Herselman

From: Sean Redmond [mailto:sean.redmo...@gmail.com]
Sent: Thursday, 11 January 2018 12:45 AM
To: David Herselman <d...@syrex.co>
Cc: Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate 
pgs?

Hi,

I have a case where 3 out to 12 of these Intel S4600 2TB model failed within a 
matter of days after being burn-in tested then placed into production.

I am interested to know, did you every get any further feedback from the vendor 
on your issue?

Thanks

On Thu, Dec 21, 2017 at 1:38 PM, David Herselman 
<d...@syrex.co<mailto:d...@syrex.co>> wrote:
Hi,

I assume this can only be a physical manufacturing flaw or a firmware bug? Do 
Intel publish advisories on recalled equipment? Should others be concerned 
about using Intel DC S4600 SSD drives? Could this be an electrical issue on the 
Hot Swap Backplane or BMC firmware issue? Either way, all pure Intel...

The hole is only 1.3 GB (4 MB x 339 objects) but perfectly striped through 
images, file systems are subsequently severely damaged.

Is it possible to get Ceph to read in partial data shards? It would provide 
between 25-75% more yield...


Is there anything wrong with how we've proceeded thus far? Would be nice to 
reference examples of using ceph-objectstore-tool but documentation is 
virtually non-existent.

We used another SSD drive to simulate bringing all the SSDs back online. We 
carved up the drive to provide equal partitions to essentially simulate the 
original SSDs:
  # Partition a drive to provide 12 x 150GB partitions, eg:
sdd   8:48   0   1.8T  0 disk
|-sdd18:49   0   140G  0 part
|-sdd28:50   0   140G  0 part
|-sdd38:51   0   140G  0 part
|-sdd48:52   0   140G  0 part
|-sdd58:53   0   140G  0 part
|-sdd68:54   0   140G  0 part
|-sdd78:55   0   140G  0 part
|-sdd88:56   0   140G  0 part
|-sdd98:57   0   140G  0 part
|-sdd10   8:58   0   140G  0 part
|-sdd11   8:59   0   140G  0 part
+-sdd12   8:60   0   140G  0 part


  Pre-requisites:
ceph osd set noout;
apt-get install uuid-runtime;


  for ID in `seq 24 35`; do
UUID=`uuidgen`;
OSD_SECRET=`ceph-authtool --gen-print-key`;
DEVICE='/dev/sdd'$[$ID-23]; # 24-23 = /dev/sdd1, 35-23 = /dev/sdd12
echo "{\"cephx_secret\": \"$OSD_SECRET\"}" | ceph osd new $UUID $ID -i - -n 
client.bootstrap-osd -k /var/lib/ceph/bootstrap-osd/ceph.keyring;
mkdir /var/lib/ceph/osd/ceph-$ID;
mkfs.xfs $DEVICE;
mount $DEVICE /var/lib/ceph/osd/ceph-$ID;
ceph-authtool --create-keyring /var/lib/ceph/osd/ceph-$ID/keyring --name 
osd.$ID --add-key $OSD_SECRET;
ceph-osd -i $ID --mkfs --osd-uuid $UUID;
chown -R ceph:ceph /var/lib/ceph/osd/ceph-$ID;
systemctl enable ceph-osd@$ID;
systemctl start ceph-osd@$ID;
  done


Once up we imported previous exports of empty head files in to 'real' OSDs:
  kvm5b:
systemctl stop ceph-osd@8;
ceph-objectstore-tool --op import --pgid 7.4s0 --data-path 
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file 
/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export;
chown ceph:ceph -R /var/lib/ceph/osd/ceph-8;
systemctl start ceph-osd@8;
  kvm5f:
systemctl stop ceph-osd@23;
ceph-objectstore-tool --op import --pgid 7.fs0 --data-path 
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal 
--file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export;
chown ceph:ceph -R /var/lib/ceph/osd/ceph-23;
systemctl start ceph-osd@23;


Bulk import previously exported objects:
cd /var/lib/vz/template/ssd_recovery;
for FILE in `ls -1A osd*_*.export | grep -Pv '^osd(8|23)_'`; do
  OSD=`echo $FILE | perl -pe 's/^osd(\d+).*/\1/'`;
  PGID=`echo $FILE | 

Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-31 Thread David Herselman
Hi Travis,

In my experience, after converting OSDs from hdd FileStore with ssd journals to 
hdd BlueStore with RocksDB and it's WAL on ssd, FileStore is significantly 
faster.

https://forum.proxmox.com/threads/ceph-bluestore-not-always-faster-than-filestore.38405/

Pure ssd OSDs would however be much faster using BlueStore...


Regards
David Herselman

On 29 Dec 2017 22:06, Travis Nielsen <travis.niel...@quantum.com> wrote:
Since bluestore was declared stable in Luminous, is there any remaining
scenario to use filestore in new deployments? Or is it safe to assume that
bluestore is always better to use in Luminous? All documentation I can
find points to bluestore being superior in all cases.

Thanks,
Travis

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Copy locked parent and clones to another pool

2017-12-24 Thread David Herselman
Pc "^\s+\d+\s+$2\s"` -gt 0 ] && return 0 || return 
1;
}

# Backup 'template-debian-9.3' (clone parent) - Should never change so no need 
to maintain snapshots or run it on a continual basis:
#for ID in 210; do
#  for f in 1 2 3; do
#echo -en "\t\t : Copying "$src"/base-"$ID"-disk-"$f"@__base__ to 
"$dst"/vm-"$ID"-disk-"$f"_backup";
#qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp -S 4M 
rbd:"$src"/base-"$ID"-disk-"$f"@__base__ rbd:"$dst"/vm-"$ID"-disk-"$f"_backup;
#  done
#done

# Backup images (clone children):
for ID in 211 212 213 214; do
  for f in 1 2 3; do
rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap3 && rbdsnap 
"$dst"/vm-"$ID"-disk-"$f"_backup snap2 && rbd snap rm 
"$dst"/vm-"$ID"-disk-"$f"_backup@snap3;
rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap3 || rbdsnap 
"$dst"/vm-"$ID"-disk-"$f"_backup snap2 && rbd snap rename 
"$dst"/vm-"$ID"-disk-"$f"_backup@snap2 "$dst"/vm-"$ID"-disk-"$f"_backup@snap3;
rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap2 || rbdsnap 
"$dst"/vm-"$ID"-disk-"$f"_backup snap1 && rbd snap rename 
"$dst"/vm-"$ID"-disk-"$f"_backup@snap1 "$dst"/vm-"$ID"-disk-"$f"_backup@snap2;
rbdsnap "$dst"/vm-"$ID"-disk-"$f"_backup snap1 || rbd snap create 
"$dst"/vm-"$ID"-disk-"$f"_backup@snap1;
rbd snap create "$src"/vm-"$ID"-disk-"$f"@backupinprogress;
  done
  for f in 1 2 3; do
echo -en "\t\t : Copying "$src"/vm-"$ID"-disk-"$f" to 
"$dst"/vm-"$ID"-disk-"$f"_backup";
#qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp -S 4M 
rbd:"$src"/vm-"$ID"-disk-"$f"@backupinprogress 
rbd:"$dst"/vm-"$ID"-disk-"$f"_backup;
export dev1=`rbd map "$src"/vm-"$ID"-disk-"$f@backupinprogress" --name 
client.admin -k /etc/pve/priv/ceph.client.admin.keyring`;
export dev2=`rbd map "$dst"/vm-"$ID"-disk-"$f"_backup --name client.admin 
-k /etc/pve/priv/ceph.client.admin.keyring`;
perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};print md5($_)' $dev2 |
  perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};$b=md5($_);
read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 
|
  perl -ne 'BEGIN{$/=\1} if ($_ eq"s") {$s++} else {if ($s) {
seek STDOUT,$s*4194304,1; $s=0}; read ARGV,$buf,4194304; print 
$buf}' 1<> $dev2;
rbd unmap $dev1;
rbd unmap $dev2;
rbd snap rm "$src"/vm-"$ID"-disk-"$f"@backupinprogress;
  done
done



Commenting out everything from 'export dev1' to 'rbd unmap $dev2' and 
uncommenting out the qemu-img command yields the following:
  real0m48.598s
  user0m14.583s
  sys 0m10.986s
[admin@kvm5a ~]# rbd du rbd_hdd/vm-211-disk-3_backup
NAME   PROVISIONED   USED
vm-211-disk-3_backup@snap3  20480M  2764M
vm-211-disk-3_backup@snap2  20480M  2764M
vm-211-disk-3_backup@snap1  20480M  2764M
vm-211-disk-3_backup20480M  2764M
 20480M 11056M


Repeating the copy using the Perl solution is much slower but as the VM is 
currently off nothing has changed and each snapshot consumes zero data:
  real1m49.000s
  user1m34.339s
 sys 0m17.847s
[admin@kvm5a ~]# rbd du rbd_hdd/vm-211-disk-3_backup
warning: fast-diff map is not enabled for vm-211-disk-3_backup. operation may 
be slow.
NAME   PROVISIONED  USED
vm-211-disk-3_backup@snap3  20480M 2764M
vm-211-disk-3_backup@snap2  20480M 0
vm-211-disk-3_backup@snap1  20480M 0
vm-211-disk-3_backup20480M 0
 20480M 2764M


PS: Not if this that is a Ceph display bug, why would the snapshot base be 
reported as not consuming any data and the first snapshot (rotated to 'snap3') 
report all the usage? Purging all snapshots yields the following:
[admin@kvm5a ~]# rbd du rbd_hdd/vm-211-disk-3_backup
warning: fast-diff map is not enabled for vm-211-disk-3_backup. operation may 
be slow.
NAME PROVISIONED  USED
vm-211-disk-3_backup  20480M 2764M


Regards
David Herselman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2017-12-21 Thread David Herselman
ceph-objectstore-tool --op import --pgid 7.4s5 
--data-path /var/lib/ceph/osd/ceph-34 --journal-path 
/var/lib/ceph/osd/ceph-34/journal --file 
/var/lib/vz/template/ssd_recovery/osd34_7.4s5.export
systemctl stop ceph-osd@34   ceph-objectstore-tool --op import --pgid 7.fs1 
--data-path /var/lib/ceph/osd/ceph-34 --journal-path 
/var/lib/ceph/osd/ceph-34/journal --file 
/var/lib/vz/template/ssd_recovery/osd34_7.fs1.export


Reset permissions and then started the OSDs:
for OSD in 27 30 31 32 34; do
  chown -R ceph:ceph /var/lib/ceph/osd/ceph-$OSD;
  systemctl start ceph-osd@$OSD;
done


Then finally started all the OSDs... Now to hope that Intel have a way of 
accessing drives that are in a 'disable logical state'.



The imports succeed, herewith a link to the output after running an import for 
placement group 7.4s2 on OSD 31: 
  https://drive.google.com/open?id=1-Jo1jmrWrGLO2OgflacGPlEf2p32Y4hn

Sample snippet:
Write 1#7:fffcd2ec:::rbd_data.4.be8e9974b0dc51.2869:head#
snapset 0=[]:{}
Write 1#7:fffd4823:::rbd_data.4.ba24ef2ae8944a.a2b0:head#
snapset 0=[]:{}
Write 1#7:fffd6fb6:::benchmark_data_kvm5b_20945_object14722:head#
snapset 0=[]:{}
Write 1#7:a069:::rbd_data.4.ba24ef2ae8944a.aea9:head#
snapset 0=[]:{}
Import successful


Data does get written, I can tell by the size of the FileStore mount points:
  [root@kvm5b ssd_recovery]# df -h | grep -P 'ceph-(27|30|31|32|34)$'
  /dev/sdd4   140G  5.2G  135G   4% /var/lib/ceph/osd/ceph-27
  /dev/sdd7   140G   14G  127G  10% /var/lib/ceph/osd/ceph-30
  /dev/sdd8   140G   14G  127G  10% /var/lib/ceph/osd/ceph-31
  /dev/sdd9   140G   22G  119G  16% /var/lib/ceph/osd/ceph-32
  /dev/sdd11  140G   22G  119G  16% /var/lib/ceph/osd/ceph-34


How do I tell Ceph to read these object shards?



PS: It's probably a good idea to reweight the OSDs to 0 before starting again. 
This should prevent data flowing on to them, if they are not in a different 
device class or other crush selection ruleset. Ie:
  for OSD in `seq 24 35`; do
ceph osd crush reweight osd.$OSD 0;
  done


Regards
David Herselman

-Original Message-
From: David Herselman 
Sent: Thursday, 21 December 2017 3:49 AM
To: 'Christian Balzer' <ch...@gol.com>; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Many concurrent drive failures - How do I activate 
pgs?

Hi Christian,

Thanks for taking the time, I haven't been contacted by anyone yet but managed 
to get the down placement groups cleared by exporting 7.4s0 and 7.fs0 and then 
marking them as complete on the surviving OSDs:
kvm5c:
  ceph-objectstore-tool --op export --pgid 7.4s0 --data-path 
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file 
/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export;
  ceph-objectstore-tool --op mark-complete --data-path 
/var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --pgid 
7.4s0;
kvm5f:
  ceph-objectstore-tool --op export --pgid 7.fs0 --data-path 
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal 
--file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export;
  ceph-objectstore-tool --op mark-complete --data-path 
/var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal 
--pgid 7.fs0;

This would presumably simply punch holes in the RBD images but at least we can 
copy them out of that pool and hope that Intel can somehow unlock the drives 
for us to then export/import objects.


To answer your questions though, we have 6 near identical Intel Wildcat Pass 1U 
servers and have Proxmox loaded on them. Proxmox uses a Debian 9 base with the 
Ubuntu kernel, for which they apply cherry picked kernel patches (eg Intel NIC 
driver updates, vhost perf regression and mem-leak fixes, etc):

kvm5a:
   Intel R1208WTTGSR System (serial: BQWS55091014)
   Intel S2600WTTR Motherboard (serial: BQWL54950385, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
   2 x Intel Xeon E5-2640v4 2.4GHz (HT disabled)
   24 x Micron 8GB DDR4 2133MHz (24 x 18ASF1G72PZ-2G1B1)
   Intel AXX10GBNIA I/O Module
kvm5b:
   Intel R1208WTTGS System (serial: BQWS53890178)
   Intel S2600WTT Motherboard (serial: BQWL52550359, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
   2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled)
   4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
   Intel AXX10GBNIA I/O Module
kvm5c:
   Intel R1208WT2GS System (serial: BQWS50490279)
   Intel S2600WT2 Motherboard (serial: BQWL44650203, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
   2 x Intel Xeon E5-2640v3 2.6GHz (HT enabled)
   4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2)
   Intel AXX10GBNIA I/O Module
kvm5d:
   Intel R1208WTTGSR System (serial: BQWS62291318)
   Intel S2600WTTR Motherboard (serial: BQWL61855187, BIOS ID: 
SE5C610.86B.01.01.0021.032120170601)
   2 x In

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2017-12-20 Thread David Herselman
anitizeCryptoScrambleSupported : True
SanitizeSupported : True
SataGen1 : True
SataGen2 : True
SataGen3 : True
SataNegotiatedSpeed : Unknown
SectorSize : 512
SecurityEnabled : False
SecurityFrozen : False
SecurityLocked : False
SecuritySupported : False
SerialNumber : PHYM7276031E1P9DGN
TCGSupported : False
TargetID : 0
TempThreshold : Selected drive is in a disable logical state.
TemperatureLoggingInterval : Selected drive is in a disable logical state.
TimeLimitedErrorRecovery : Selected drive is in a disable logical state.
TrimSize : 4
TrimSupported : True
VolatileWriteCacheEnabled : Selected drive is in a disable logical state.
WWID : 3959312879584368077
WriteAtomicityDisableNormal : Selected drive is in a disable logical state.
WriteCacheEnabled : True
WriteCacheReorderingStateEnabled : Selected drive is in a disable logical state.
WriteCacheState : Selected drive is in a disable logical state.
WriteCacheSupported : True
WriteErrorRecoveryTimer : Selected drive is in a disable logical state.



SMART information is inaccessible, overall status is failed. Herewith the stats 
from a partner disc which was still working when the others failed:
Device Model: INTEL SSDSC2KG019T7
Serial Number:PHYM727602TM1P9DGN
LU WWN Device Id: 5 5cd2e4 14e1636bb
Firmware Version: SCV10100
User Capacity:1,920,383,410,176 bytes [1.92 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:Solid State Device
Form Factor:  2.5 inches
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:Mon Dec 18 19:33:51 2017 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000Old_age   Always   
-   0
  9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
-   98
12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   - 
  3
170 Unknown_Attribute   0x0033   100   100   010Pre-fail  Always   
-   0
171 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   1
172 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   0
174 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   0
175 Program_Fail_Count_Chip 0x0033   100   100   010Pre-fail  Always   
-   17567121432
183 Runtime_Bad_Block   0x0032   100   100   000Old_age   Always   
-   0
184 End-to-End_Error0x0033   100   100   090Pre-fail  Always   
-   0
187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always   
-   0
190 Airflow_Temperature_Cel 0x0022   077   076   000Old_age   Always   
-   23 (Min/Max 17/29)
192 Power-Off_Retract_Count 0x0032   100   100   000Old_age   Always   
-   0
194 Temperature_Celsius 0x0022   100   100   000Old_age   Always   
-   23
197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always   
-   0
199 UDMA_CRC_Error_Count0x003e   100   100   000Old_age   Always   
-   0
225 Unknown_SSD_Attribute   0x0032   100   100   000Old_age   Always   
-   14195
226 Unknown_SSD_Attribute   0x0032   100   100   000Old_age   Always   
-   0
227 Unknown_SSD_Attribute   0x0032   100   100   000Old_age   Always   
-   42
228 Power-off_Retract_Count 0x0032   100   100   000Old_age   Always   
-   5905
232 Available_Reservd_Space 0x0033   100   100   010Pre-fail  Always   
-   0
233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always   
-   0
234 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   0
241 Total_LBAs_Written  0x0032   100   100   000Old_age   Always   
-   14195
242 Total_LBAs_Read 0x0032   100   100   000Old_age   Always   
-   10422
243 Unknown_Attribute   0x0032   100   100   000Old_age   Always   
-   41906


Media wear out : 0% used
LBAs written: 14195
Power on hours: <100
Power cycle count: once at the factory, once at our offices to check if there 
was newer firmware (there wasn't) and once when we restarted the node to see if 
it could then access a failed drive.


Regards
David Herselman


-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: Thursday, 21 December 2017 3:24 AM
To: ceph-users@lists.ceph.com
Cc: David Herselman <d...@syrex.co>
Subject: Re: [ceph-users] Many concurrent drive failures - How do I activate 
pgs?


Hello,

first off, I don't have anything to add to your conclusions of the current 
status, alas there are at least 2 folks here on the ML making a living from

[ceph-users] Many concurrent drive failures - How do I activate pgs?

2017-12-20 Thread David Herselman
 noout flag(s) set
PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 3 pgs down
pg 4.43 is down, acting [4,15,18]
pg 7.4 is down, acting [8,5,21,18,15,0]
pg 7.f is down, acting [23,0,16,5,11,14]
PG_DEGRADED Degraded data redundancy: 3 pgs unclean
pg 4.43 is stuck unclean since forever, current state down, last acting 
[4,15,18]
pg 7.4 is stuck unclean since forever, current state down, last acting 
[8,5,21,18,15,0]
pg 7.f is stuck unclean since forever, current state down, last acting 
[23,0,16,5,11,14]

Original 'ceph pg X query' status (before we mucked around by exporting and 
deleting OSDs): https://pastebin.com/fBQhq6UQ
Current 'ceph pg X query' status (after recreating temporary OSDs with the 
original IDs and importing the exports): https://pastebin.com/qcN5uYkN


What we assume needs to be done:
 - Tell Ceph that the OSDs are lost (query status in the pastebin above reports 
'starting or marking this osd lost may let us proceed'). We have stopped, 
marked the temporary OSDs as out and run 'ceph osd lost $ID 
--yes-i-really-mean-it' already though.
  - Somehow get Ceph to forget about the sharded objects it doesn't have 
sufficient pieces of.
  - Copy the images to another pool so that we can get pieces of data off these 
and rebuild those systems.
  - Hopefully get Intel to unlock the drives, export as much of the content as 
possible and import the various exports so that we can ultimately copy off 
complete images.

Really, really hoping to have a Merry Christmas... ;)

PS: We got the 80 GB image out, it had a single 4MB object hole so we used 
ddrescue to read the source image forwards, rebooted the node when it stalled 
on the missing data and repeated the copy in reverse direction thereafter...


Regards
David Herselman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com