Re: [ceph-users] ceph cluster inconsistency keyvaluestore
Hmm, could you please list your instructions including cluster existing time and all relevant ops? I want to reproduce it. On Mon, Sep 1, 2014 at 4:45 PM, Kenneth Waegeman wrote: > Hi, > > I reinstalled the cluster with 0.84, and tried again running rados bench > on a EC coded pool on keyvaluestore. > Nothing crashed this time, but when I check the status: > > health HEALTH_ERR 128 pgs inconsistent; 128 scrub errors; too few pgs > per osd (15 < min 20) > monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0, > ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch > 8, quorum 0,1,2 ceph001,ceph002,ceph003 > osdmap e174: 78 osds: 78 up, 78 in > pgmap v147680: 1216 pgs, 3 pools, 14758 GB data, 3690 kobjects > 1753 GB used, 129 TB / 131 TB avail > 1088 active+clean > 128 active+clean+inconsistent > > the 128 inconsistent pgs are ALL the pgs of the EC KV store ( the others > are on Filestore) > > The only thing I can see in the logs is that after the rados tests, it > start scrubbing, and for each KV pg I get something like this: > > 2014-08-31 11:14:09.050747 osd.11 10.141.8.180:6833/61098 4 : [ERR] 2.3s0 > scrub stat mismatch, got 28164/29291 objects, 0/0 clones, 28164/29291 > dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, > 118128377856/122855358464 bytes. > > What could here be the problem? > Thanks again!! > > Kenneth > > > - Message from Haomai Wang - >Date: Tue, 26 Aug 2014 17:11:43 +0800 >From: Haomai Wang > Subject: Re: [ceph-users] ceph cluster inconsistency? > To: Kenneth Waegeman > Cc: ceph-users@lists.ceph.com > > > Hmm, it looks like you hit this bug(http://tracker.ceph.com/issues/9223). >> >> Sorry for the late message, I forget that this fix is merged into 0.84. >> >> Thanks for your patient :-) >> >> On Tue, Aug 26, 2014 at 4:39 PM, Kenneth Waegeman >> wrote: >> >>> >>> Hi, >>> >>> In the meantime I already tried with upgrading the cluster to 0.84, to >>> see >>> if that made a difference, and it seems it does. >>> I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' >>> anymore. >>> >>> But now the cluster detect it is inconsistent: >>> >>> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >>>health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few >>> pgs >>> per osd (4 < min 20); mon.ceph002 low disk space >>>monmap e3: 3 mons at >>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0, >>> ceph003=10.141.8.182:6789/0}, >>> election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003 >>>mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 >>> up:standby >>>osdmap e145384: 78 osds: 78 up, 78 in >>> pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects >>> 1502 GB used, 129 TB / 131 TB avail >>>279 active+clean >>> 40 active+clean+inconsistent >>> 1 active+clean+scrubbing+deep >>> >>> >>> I tried to do ceph pg repair for all the inconsistent pgs: >>> >>> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >>>health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub >>> errors; >>> too few pgs per osd (4 < min 20); mon.ceph002 low disk space >>>monmap e3: 3 mons at >>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0, >>> ceph003=10.141.8.182:6789/0}, >>> election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003 >>>mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 >>> up:standby >>>osdmap e146452: 78 osds: 78 up, 78 in >>> pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects >>> 1503 GB used, 129 TB / 131 TB avail >>>279 active+clean >>> 39 active+clean+inconsistent >>> 1 active+clean+scrubbing+deep >>> 1 active+clean+scrubbing+deep+inconsistent+repair >>> >>> I let it recovering through the night, but this morning the mons were all >>> gone, nothing to see in the log files.. The osds were all still up! >>> >>> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >>> health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub >>> errors; >>> too few pgs per osd (4 < min 20) >>> monmap e7: 3 mons at >>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0, >>> ceph003=10.141.8.182:6789/0}, >>> election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003 >>> mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 >>> up:standby >>> osdmap e203410: 78 osds: 78 up, 78 in >>> pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects >>> 1547 GB used, 129 TB / 131 TB avail >>>1 active+clean+scrubbing+deep+inconsistent+repair >>> 284 active+clean >>> 35 active+clean+inconsistent >>> >>> I restarted the monitors now, I will let you know when I see something >>> more.. >>> >>
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
Hi, The cluster got installed with quattor, which uses ceph-deploy for installation of daemons, writes the config file and installs the crushmap. I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the ECdata pool and a small cache partition (50G) for the cache I manually did this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) (But the previous time I had the problem already without the cache part) Cluster live since 2014-08-29 15:34:16 Config file on host ceph001: [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.143.8.0/24 filestore_xattr_use_omap = 1 fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d mon_cluster_log_to_syslog = 1 mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os mon_initial_members = ceph001, ceph002, ceph003 osd_crush_update_on_start = 0 osd_journal_size = 10240 osd_pool_default_min_size = 2 osd_pool_default_pg_num = 512 osd_pool_default_pgp_num = 512 osd_pool_default_size = 3 public_network = 10.141.8.0/24 [osd.11] osd_objectstore = keyvaluestore-dev [osd.13] osd_objectstore = keyvaluestore-dev [osd.15] osd_objectstore = keyvaluestore-dev [osd.17] osd_objectstore = keyvaluestore-dev [osd.19] osd_objectstore = keyvaluestore-dev [osd.21] osd_objectstore = keyvaluestore-dev [osd.23] osd_objectstore = keyvaluestore-dev [osd.25] osd_objectstore = keyvaluestore-dev [osd.3] osd_objectstore = keyvaluestore-dev [osd.5] osd_objectstore = keyvaluestore-dev [osd.7] osd_objectstore = keyvaluestore-dev [osd.9] osd_objectstore = keyvaluestore-dev OSDs: # idweight type name up/down reweight -12 140.6 root default-cache -9 46.87 host ceph001-cache 2 3.906 osd.2 up 1 4 3.906 osd.4 up 1 6 3.906 osd.6 up 1 8 3.906 osd.8 up 1 10 3.906 osd.10 up 1 12 3.906 osd.12 up 1 14 3.906 osd.14 up 1 16 3.906 osd.16 up 1 18 3.906 osd.18 up 1 20 3.906 osd.20 up 1 22 3.906 osd.22 up 1 24 3.906 osd.24 up 1 -10 46.87 host ceph002-cache 28 3.906 osd.28 up 1 30 3.906 osd.30 up 1 32 3.906 osd.32 up 1 34 3.906 osd.34 up 1 36 3.906 osd.36 up 1 38 3.906 osd.38 up 1 40 3.906 osd.40 up 1 42 3.906 osd.42 up 1 44 3.906 osd.44 up 1 46 3.906 osd.46 up 1 48 3.906 osd.48 up 1 50 3.906 osd.50 up 1 -11 46.87 host ceph003-cache 54 3.906 osd.54 up 1 56 3.906 osd.56 up 1 58 3.906 osd.58 up 1 60 3.906 osd.60 up 1 62 3.906 osd.62 up 1 64 3.906 osd.64 up 1 66 3.906 osd.66 up 1 68 3.906 osd.68 up 1 70 3.906 osd.70 up 1 72 3.906 osd.72 up 1 74 3.906 osd.74 up 1 76 3.906 osd.76 up 1 -8 140.6 root default-ec -5 46.87 host ceph001-ec 3 3.906 osd.3 up 1 5 3.906 osd.5 up 1 7 3.906 osd.7 up 1 9 3.906 osd.9 up 1 11 3.906 osd.11 up 1 13 3.906 osd.13 up 1 15 3.906 osd.15 up 1 17 3.906 osd.17 up 1 19 3.906 osd.19 up 1 21 3.906 osd.21 up 1 23 3.906 osd.23 up 1 25 3.906 osd.25 up 1 -6 46.87 host ceph002-ec 29 3.906 osd.29 up 1 31 3.906 osd.31 up 1 33 3.906 osd.33 up 1 35 3.906
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
I also can reproduce it on a new slightly different set up (also EC on KV and Cache) by running ceph pg scrub on a KV pg: this pg will then get the 'inconsistent' status - Message from Kenneth Waegeman - Date: Mon, 01 Sep 2014 16:28:31 +0200 From: Kenneth Waegeman Subject: Re: ceph cluster inconsistency keyvaluestore To: Haomai Wang Cc: ceph-users@lists.ceph.com Hi, The cluster got installed with quattor, which uses ceph-deploy for installation of daemons, writes the config file and installs the crushmap. I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the ECdata pool and a small cache partition (50G) for the cache I manually did this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) (But the previous time I had the problem already without the cache part) Cluster live since 2014-08-29 15:34:16 Config file on host ceph001: [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.143.8.0/24 filestore_xattr_use_omap = 1 fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d mon_cluster_log_to_syslog = 1 mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os mon_initial_members = ceph001, ceph002, ceph003 osd_crush_update_on_start = 0 osd_journal_size = 10240 osd_pool_default_min_size = 2 osd_pool_default_pg_num = 512 osd_pool_default_pgp_num = 512 osd_pool_default_size = 3 public_network = 10.141.8.0/24 [osd.11] osd_objectstore = keyvaluestore-dev [osd.13] osd_objectstore = keyvaluestore-dev [osd.15] osd_objectstore = keyvaluestore-dev [osd.17] osd_objectstore = keyvaluestore-dev [osd.19] osd_objectstore = keyvaluestore-dev [osd.21] osd_objectstore = keyvaluestore-dev [osd.23] osd_objectstore = keyvaluestore-dev [osd.25] osd_objectstore = keyvaluestore-dev [osd.3] osd_objectstore = keyvaluestore-dev [osd.5] osd_objectstore = keyvaluestore-dev [osd.7] osd_objectstore = keyvaluestore-dev [osd.9] osd_objectstore = keyvaluestore-dev OSDs: # idweight type name up/down reweight -12 140.6 root default-cache -9 46.87 host ceph001-cache 2 3.906 osd.2 up 1 4 3.906 osd.4 up 1 6 3.906 osd.6 up 1 8 3.906 osd.8 up 1 10 3.906 osd.10 up 1 12 3.906 osd.12 up 1 14 3.906 osd.14 up 1 16 3.906 osd.16 up 1 18 3.906 osd.18 up 1 20 3.906 osd.20 up 1 22 3.906 osd.22 up 1 24 3.906 osd.24 up 1 -10 46.87 host ceph002-cache 28 3.906 osd.28 up 1 30 3.906 osd.30 up 1 32 3.906 osd.32 up 1 34 3.906 osd.34 up 1 36 3.906 osd.36 up 1 38 3.906 osd.38 up 1 40 3.906 osd.40 up 1 42 3.906 osd.42 up 1 44 3.906 osd.44 up 1 46 3.906 osd.46 up 1 48 3.906 osd.48 up 1 50 3.906 osd.50 up 1 -11 46.87 host ceph003-cache 54 3.906 osd.54 up 1 56 3.906 osd.56 up 1 58 3.906 osd.58 up 1 60 3.906 osd.60 up 1 62 3.906 osd.62 up 1 64 3.906 osd.64 up 1 66 3.906 osd.66 up 1 68 3.906 osd.68 up 1 70 3.906 osd.70 up 1 72 3.906 osd.72 up 1 74 3.906 osd.74 up 1 76 3.906 osd.76 up 1 -8 140.6 root default-ec -5 46.87 host ceph001-ec 3 3.906 osd.3 up 1 5 3.906 osd.5 up 1 7 3.906 osd.7 up 1 9 3.906 osd.9 up 1 11 3.906 osd.11 up 1 13 3.906 osd.13 up 1 15 3.906 osd.15 up 1 17 3.906 osd.17 up 1
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
Sorry for the late message, I'm back from a short vacation. I would like to try it this weekends. Thanks for your patient :-) On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman wrote: > I also can reproduce it on a new slightly different set up (also EC on KV > and Cache) by running ceph pg scrub on a KV pg: this pg will then get the > 'inconsistent' status > > > > - Message from Kenneth Waegeman - >Date: Mon, 01 Sep 2014 16:28:31 +0200 >From: Kenneth Waegeman > Subject: Re: ceph cluster inconsistency keyvaluestore > To: Haomai Wang > Cc: ceph-users@lists.ceph.com > > > >> Hi, >> >> >> The cluster got installed with quattor, which uses ceph-deploy for >> installation of daemons, writes the config file and installs the crushmap. >> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the >> ECdata pool and a small cache partition (50G) for the cache >> >> I manually did this: >> >> ceph osd pool create cache 1024 1024 >> ceph osd pool set cache size 2 >> ceph osd pool set cache min_size 1 >> ceph osd erasure-code-profile set profile11 k=8 m=3 >> ruleset-failure-domain=osd >> ceph osd pool create ecdata 128 128 erasure profile11 >> ceph osd tier add ecdata cache >> ceph osd tier cache-mode cache writeback >> ceph osd tier set-overlay ecdata cache >> ceph osd pool set cache hit_set_type bloom >> ceph osd pool set cache hit_set_count 1 >> ceph osd pool set cache hit_set_period 3600 >> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) >> >> (But the previous time I had the problem already without the cache part) >> >> >> >> Cluster live since 2014-08-29 15:34:16 >> >> Config file on host ceph001: >> >> [global] >> auth_client_required = cephx >> auth_cluster_required = cephx >> auth_service_required = cephx >> cluster_network = 10.143.8.0/24 >> filestore_xattr_use_omap = 1 >> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >> mon_cluster_log_to_syslog = 1 >> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os >> mon_initial_members = ceph001, ceph002, ceph003 >> osd_crush_update_on_start = 0 >> osd_journal_size = 10240 >> osd_pool_default_min_size = 2 >> osd_pool_default_pg_num = 512 >> osd_pool_default_pgp_num = 512 >> osd_pool_default_size = 3 >> public_network = 10.141.8.0/24 >> >> [osd.11] >> osd_objectstore = keyvaluestore-dev >> >> [osd.13] >> osd_objectstore = keyvaluestore-dev >> >> [osd.15] >> osd_objectstore = keyvaluestore-dev >> >> [osd.17] >> osd_objectstore = keyvaluestore-dev >> >> [osd.19] >> osd_objectstore = keyvaluestore-dev >> >> [osd.21] >> osd_objectstore = keyvaluestore-dev >> >> [osd.23] >> osd_objectstore = keyvaluestore-dev >> >> [osd.25] >> osd_objectstore = keyvaluestore-dev >> >> [osd.3] >> osd_objectstore = keyvaluestore-dev >> >> [osd.5] >> osd_objectstore = keyvaluestore-dev >> >> [osd.7] >> osd_objectstore = keyvaluestore-dev >> >> [osd.9] >> osd_objectstore = keyvaluestore-dev >> >> >> OSDs: >> # idweight type name up/down reweight >> -12 140.6 root default-cache >> -9 46.87 host ceph001-cache >> 2 3.906 osd.2 up 1 >> 4 3.906 osd.4 up 1 >> 6 3.906 osd.6 up 1 >> 8 3.906 osd.8 up 1 >> 10 3.906 osd.10 up 1 >> 12 3.906 osd.12 up 1 >> 14 3.906 osd.14 up 1 >> 16 3.906 osd.16 up 1 >> 18 3.906 osd.18 up 1 >> 20 3.906 osd.20 up 1 >> 22 3.906 osd.22 up 1 >> 24 3.906 osd.24 up 1 >> -10 46.87 host ceph002-cache >> 28 3.906 osd.28 up 1 >> 30 3.906 osd.30 up 1 >> 32 3.906 osd.32 up 1 >> 34 3.906 osd.34 up 1 >> 36 3.906 osd.36 up 1 >> 38 3.906 osd.38 up 1 >> 40 3.906 osd.40 up 1 >> 42 3.906 osd.42 up 1 >> 44 3.906 osd.44 up 1 >> 46 3.906 osd.46 up 1 >> 48 3.906 osd.48 up 1 >> 50 3.906 osd.50 up 1 >> -11 46.87 host ceph003-cache >> 54 3.906 osd.54 up 1 >> 56 3.906 osd.56 up 1 >> 58 3.906 osd.58 up 1 >> 60 3.906 osd.60 up 1 >> 62 3.906 osd.62 up 1 >> 64 3.906 osd.64 up 1 >> 66 3.906 osd.66 up 1 >> 68 3.906 osd.68 up 1 >> 70 3.906 osd.70 up 1 >> 72 3.906 osd.72 up 1 >> 74 3.906
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
I have found the root cause. It's a bug. When chunky scrub happen, it will iterate the who pg's objects and each iterator only a few objects will be scan. osd/PG.cc:3758 ret = get_pgbackend()-> objects_list_partial( start, cct->_conf->osd_scrub_chunk_min, cct->_conf->osd_scrub_chunk_max, 0, &objects, &candidate_end); candidate_end is the end of object set and it's used to indicate the next scrub process's start position. But it will be truncated: osd/PG.cc:3777 while (!boundary_found && objects.size() > 1) { hobject_t end = objects.back().get_boundary(); objects.pop_back(); if (objects.back().get_filestore_key() != end.get_filestore_key()) { candidate_end = end; boundary_found = true; } } end which only contain "hash" field as hobject_t will be assign to candidate_end. So the next scrub process a hobject_t only contains "hash" field will be passed in to get_pgbackend()-> objects_list_partial. It will cause incorrect results for KeyValueStore backend. Because it will use strict key ordering for "collection_list_paritial" method. A hobject_t only contains "hash" field will be: 1%e79s0_head!972F1B5D!!none!!!!0!0 and the actual object is 1%e79s0_head!972F1B5D!!1!!!object-name!head In other word, a object only contain "hash" field can't used by to search a absolute object has the same "hash" field. @sage The simply way is modify obj->key function which will change storage format. Because it's a experiment backend I would like to provide with a external format change program help users do it. Is it OK? On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman wrote: > I also can reproduce it on a new slightly different set up (also EC on KV > and Cache) by running ceph pg scrub on a KV pg: this pg will then get the > 'inconsistent' status > > > > - Message from Kenneth Waegeman - >Date: Mon, 01 Sep 2014 16:28:31 +0200 >From: Kenneth Waegeman > Subject: Re: ceph cluster inconsistency keyvaluestore > To: Haomai Wang > Cc: ceph-users@lists.ceph.com > > > >> Hi, >> >> >> The cluster got installed with quattor, which uses ceph-deploy for >> installation of daemons, writes the config file and installs the crushmap. >> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the >> ECdata pool and a small cache partition (50G) for the cache >> >> I manually did this: >> >> ceph osd pool create cache 1024 1024 >> ceph osd pool set cache size 2 >> ceph osd pool set cache min_size 1 >> ceph osd erasure-code-profile set profile11 k=8 m=3 >> ruleset-failure-domain=osd >> ceph osd pool create ecdata 128 128 erasure profile11 >> ceph osd tier add ecdata cache >> ceph osd tier cache-mode cache writeback >> ceph osd tier set-overlay ecdata cache >> ceph osd pool set cache hit_set_type bloom >> ceph osd pool set cache hit_set_count 1 >> ceph osd pool set cache hit_set_period 3600 >> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) >> >> (But the previous time I had the problem already without the cache part) >> >> >> >> Cluster live since 2014-08-29 15:34:16 >> >> Config file on host ceph001: >> >> [global] >> auth_client_required = cephx >> auth_cluster_required = cephx >> auth_service_required = cephx >> cluster_network = 10.143.8.0/24 >> filestore_xattr_use_omap = 1 >> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >> mon_cluster_log_to_syslog = 1 >> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os >> mon_initial_members = ceph001, ceph002, ceph003 >> osd_crush_update_on_start = 0 >> osd_journal_size = 10240 >> osd_pool_default_min_size = 2 >> osd_pool_default_pg_num = 512 >> osd_pool_default_pgp_num = 512 >> osd_pool_default_size = 3 >> public_network = 10.141.8.0/24 >> >> [osd.11] >> osd_objectstore = keyvaluestore-dev >> >> [osd.13] >> osd_objectstore = keyvaluestore-dev >> >> [osd.15] >> osd_objectstore = keyvaluestore-dev >> >> [osd.17] >> osd_objectstore = keyvaluestore-dev >> >> [osd.19] >> osd_objectstore = keyvaluestore-dev >> >> [osd.21] >> osd_objectstore = keyvaluestore-dev >> >> [osd.23] >> osd_objectstore = keyvaluestore-dev >> >> [osd.25] >> osd_objectstore = keyvaluestore-dev >> >> [osd.3] >> osd_objectstore = keyvaluestore-dev >> >> [osd.5] >> osd_objectstore = keyvaluestore-dev >> >> [osd.7] >> osd_objectstore = keyvaluestore-dev >> >> [osd.9] >> osd_objectstore = keyvaluestore-dev >> >> >> OSDs: >> # idweight type name up/down reweight >> -12 140.6 root default-cache >> -9 46.87 host ceph001-cache >> 2 3.906 osd.2 up 1 >> 4 3.906 osd.4 up 1 >> 6 3.906 osd.6 up 1 >> 8 3.906 osd.8 up 1 >> 10 3.906 osd.10 up 1 >> 12 3.906 osd.12 up 1 >> 14
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
Thank you very much ! Is this problem then related to the weird sizes I see: pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects 418 GB used, 88130 GB / 88549 GB avail a calculation with df shows indeed that there is about 400GB used on disks, but the tests I ran should indeed have generated 3,5 TB, as also seen in rados df: pool name category KB objects clones degraded unfound rdrd KB wr wr KB cache - 59150443154660 0 0 1388365 5686734850 3665984 4709621763 ecdata - 3512807425 8576200 0 0 1109938312332288 857621 3512807426 I thought it was related to the inconsistency? Or can this be a sparse objects thing? (But I don't seem to found anything in the docs about that) Thanks again! Kenneth - Message from Haomai Wang - Date: Sun, 7 Sep 2014 20:34:39 +0800 From: Haomai Wang Subject: Re: ceph cluster inconsistency keyvaluestore To: Kenneth Waegeman Cc: ceph-users@lists.ceph.com I have found the root cause. It's a bug. When chunky scrub happen, it will iterate the who pg's objects and each iterator only a few objects will be scan. osd/PG.cc:3758 ret = get_pgbackend()-> objects_list_partial( start, cct->_conf->osd_scrub_chunk_min, cct->_conf->osd_scrub_chunk_max, 0, &objects, &candidate_end); candidate_end is the end of object set and it's used to indicate the next scrub process's start position. But it will be truncated: osd/PG.cc:3777 while (!boundary_found && objects.size() > 1) { hobject_t end = objects.back().get_boundary(); objects.pop_back(); if (objects.back().get_filestore_key() != end.get_filestore_key()) { candidate_end = end; boundary_found = true; } } end which only contain "hash" field as hobject_t will be assign to candidate_end. So the next scrub process a hobject_t only contains "hash" field will be passed in to get_pgbackend()-> objects_list_partial. It will cause incorrect results for KeyValueStore backend. Because it will use strict key ordering for "collection_list_paritial" method. A hobject_t only contains "hash" field will be: 1%e79s0_head!972F1B5D!!none!!!!0!0 and the actual object is 1%e79s0_head!972F1B5D!!1!!!object-name!head In other word, a object only contain "hash" field can't used by to search a absolute object has the same "hash" field. @sage The simply way is modify obj->key function which will change storage format. Because it's a experiment backend I would like to provide with a external format change program help users do it. Is it OK? On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman wrote: I also can reproduce it on a new slightly different set up (also EC on KV and Cache) by running ceph pg scrub on a KV pg: this pg will then get the 'inconsistent' status - Message from Kenneth Waegeman - Date: Mon, 01 Sep 2014 16:28:31 +0200 From: Kenneth Waegeman Subject: Re: ceph cluster inconsistency keyvaluestore To: Haomai Wang Cc: ceph-users@lists.ceph.com Hi, The cluster got installed with quattor, which uses ceph-deploy for installation of daemons, writes the config file and installs the crushmap. I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the ECdata pool and a small cache partition (50G) for the cache I manually did this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) (But the previous time I had the problem already without the cache part) Cluster live since 2014-08-29 15:34:16 Config file on host ceph001: [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.143.8.0/24 filestore_xattr_use_omap = 1 fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d mon_cluster_log_to_syslog = 1 mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os mon_initial_members = ceph001, ceph002, ceph003 osd_crush_update_on_start = 0 osd_journal_size = 10240 osd_pool_default_min_size = 2 osd_pool_default_pg_num = 512 osd_pool_default_pgp_num = 512 osd_pool_default_size = 3 public_network = 10.141.8.0/24 [osd.11] osd_objectstore = keyvaluestore-dev [osd.13] osd_o
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
I'm not very sure, it's possible that keyvaluestore will use spare write which make big difference with ceph space statistic On Mon, Sep 8, 2014 at 6:35 PM, Kenneth Waegeman wrote: > > Thank you very much ! > > Is this problem then related to the weird sizes I see: > pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects > 418 GB used, 88130 GB / 88549 GB avail > > a calculation with df shows indeed that there is about 400GB used on disks, > but the tests I ran should indeed have generated 3,5 TB, as also seen in > rados df: > > pool name category KB objects clones > degraded unfound rdrd KB wrwr KB > cache - 59150443154660 > 0 0 1388365 5686734850 3665984 4709621763 > ecdata - 3512807425 8576200 > 0 0 1109938312332288 857621 3512807426 > > I thought it was related to the inconsistency? > Or can this be a sparse objects thing? (But I don't seem to found anything > in the docs about that) > > Thanks again! > > Kenneth > > > > - Message from Haomai Wang - >Date: Sun, 7 Sep 2014 20:34:39 +0800 > >From: Haomai Wang > Subject: Re: ceph cluster inconsistency keyvaluestore > To: Kenneth Waegeman > Cc: ceph-users@lists.ceph.com > > >> I have found the root cause. It's a bug. >> >> When chunky scrub happen, it will iterate the who pg's objects and >> each iterator only a few objects will be scan. >> >> osd/PG.cc:3758 >> ret = get_pgbackend()-> objects_list_partial( >> start, >> cct->_conf->osd_scrub_chunk_min, >> cct->_conf->osd_scrub_chunk_max, >> 0, >> &objects, >> &candidate_end); >> >> candidate_end is the end of object set and it's used to indicate the >> next scrub process's start position. But it will be truncated: >> >> osd/PG.cc:3777 >> while (!boundary_found && objects.size() > 1) { >> hobject_t end = objects.back().get_boundary(); >> objects.pop_back(); >> >> if (objects.back().get_filestore_key() != >> end.get_filestore_key()) { >> candidate_end = end; >> boundary_found = true; >> } >> } >> end which only contain "hash" field as hobject_t will be assign to >> candidate_end. So the next scrub process a hobject_t only contains >> "hash" field will be passed in to get_pgbackend()-> >> objects_list_partial. >> >> It will cause incorrect results for KeyValueStore backend. Because it >> will use strict key ordering for "collection_list_paritial" method. A >> hobject_t only contains "hash" field will be: >> >> 1%e79s0_head!972F1B5D!!none!!!!0!0 >> >> and the actual object is >> 1%e79s0_head!972F1B5D!!1!!!object-name!head >> >> In other word, a object only contain "hash" field can't used by to >> search a absolute object has the same "hash" field. >> >> @sage The simply way is modify obj->key function which will change >> storage format. Because it's a experiment backend I would like to >> provide with a external format change program help users do it. Is it >> OK? >> >> >> On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman >> wrote: >>> >>> I also can reproduce it on a new slightly different set up (also EC on KV >>> and Cache) by running ceph pg scrub on a KV pg: this pg will then get the >>> 'inconsistent' status >>> >>> >>> >>> - Message from Kenneth Waegeman - >>>Date: Mon, 01 Sep 2014 16:28:31 +0200 >>>From: Kenneth Waegeman >>> Subject: Re: ceph cluster inconsistency keyvaluestore >>> To: Haomai Wang >>> Cc: ceph-users@lists.ceph.com >>> >>> >>> Hi, The cluster got installed with quattor, which uses ceph-deploy for installation of daemons, writes the config file and installs the crushmap. I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the ECdata pool and a small cache partition (50G) for the cache I manually did this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) (But the previous time I had the problem already without the cache part) Cluster live since 2014-08-29 15:34:16 Config file on host ceph001: [global] auth_client_required = cephx
Re: [ceph-users] ceph cluster inconsistency keyvaluestore
On Sun, 7 Sep 2014, Haomai Wang wrote: > I have found the root cause. It's a bug. > > When chunky scrub happen, it will iterate the who pg's objects and > each iterator only a few objects will be scan. > > osd/PG.cc:3758 > ret = get_pgbackend()-> objects_list_partial( > start, > cct->_conf->osd_scrub_chunk_min, > cct->_conf->osd_scrub_chunk_max, > 0, > &objects, > &candidate_end); > > candidate_end is the end of object set and it's used to indicate the > next scrub process's start position. But it will be truncated: > > osd/PG.cc:3777 > while (!boundary_found && objects.size() > 1) { > hobject_t end = objects.back().get_boundary(); > objects.pop_back(); > > if (objects.back().get_filestore_key() != > end.get_filestore_key()) { > candidate_end = end; > boundary_found = true; > } > } > end which only contain "hash" field as hobject_t will be assign to > candidate_end. So the next scrub process a hobject_t only contains > "hash" field will be passed in to get_pgbackend()-> > objects_list_partial. > > It will cause incorrect results for KeyValueStore backend. Because it > will use strict key ordering for "collection_list_paritial" method. A > hobject_t only contains "hash" field will be: > > 1%e79s0_head!972F1B5D!!none!!!!0!0 > > and the actual object is > 1%e79s0_head!972F1B5D!!1!!!object-name!head > > In other word, a object only contain "hash" field can't used by to > search a absolute object has the same "hash" field. You mean the problem is that the sort order is wrong and the hash-only hobject_t key doesn't sort before the other objects, right? > @sage The simply way is modify obj->key function which will change > storage format. Because it's a experiment backend I would like to > provide with a external format change program help users do it. Is it > OK? Yeah, I think it's okay to just go ahead and make an incompatible change. If it is easy to do an upgrade converter, it might be worthwhile, but this is an experimental backend so you are certainly not required to. :) sage > > > On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman > wrote: > > I also can reproduce it on a new slightly different set up (also EC on KV > > and Cache) by running ceph pg scrub on a KV pg: this pg will then get the > > 'inconsistent' status > > > > > > > > - Message from Kenneth Waegeman - > >Date: Mon, 01 Sep 2014 16:28:31 +0200 > >From: Kenneth Waegeman > > Subject: Re: ceph cluster inconsistency keyvaluestore > > To: Haomai Wang > > Cc: ceph-users@lists.ceph.com > > > > > > > >> Hi, > >> > >> > >> The cluster got installed with quattor, which uses ceph-deploy for > >> installation of daemons, writes the config file and installs the crushmap. > >> I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the > >> ECdata pool and a small cache partition (50G) for the cache > >> > >> I manually did this: > >> > >> ceph osd pool create cache 1024 1024 > >> ceph osd pool set cache size 2 > >> ceph osd pool set cache min_size 1 > >> ceph osd erasure-code-profile set profile11 k=8 m=3 > >> ruleset-failure-domain=osd > >> ceph osd pool create ecdata 128 128 erasure profile11 > >> ceph osd tier add ecdata cache > >> ceph osd tier cache-mode cache writeback > >> ceph osd tier set-overlay ecdata cache > >> ceph osd pool set cache hit_set_type bloom > >> ceph osd pool set cache hit_set_count 1 > >> ceph osd pool set cache hit_set_period 3600 > >> ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) > >> > >> (But the previous time I had the problem already without the cache part) > >> > >> > >> > >> Cluster live since 2014-08-29 15:34:16 > >> > >> Config file on host ceph001: > >> > >> [global] > >> auth_client_required = cephx > >> auth_cluster_required = cephx > >> auth_service_required = cephx > >> cluster_network = 10.143.8.0/24 > >> filestore_xattr_use_omap = 1 > >> fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d > >> mon_cluster_log_to_syslog = 1 > >> mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os > >> mon_initial_members = ceph001, ceph002, ceph003 > >> osd_crush_update_on_start = 0 > >> osd_journal_size = 10240 > >> osd_pool_default_min_size = 2 > >> osd_pool_default_pg_num = 512 > >> osd_pool_default_pgp_num = 512 > >> osd_pool_default_size = 3 > >> public_network = 10.141.8.0/24 > >> > >> [osd.11] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.13] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.15] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.17] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.19] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.21] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.23] > >> osd_objectstore = keyvaluestore-dev > >> > >> [osd.25] > >> osd_objectstore = keyvaluestore-dev > >> > >